CSC548 Project – Fall 2009

An MPI failure detector over PMPI

 

Donghoon Kim

dkim2@ncsu.edu

 

 

 

1. Problem description

Modern scientific applications on large scale MPPs have execution times ranging from day to months. These long-running MPI applications on clusters are prone to node or network failures as the systems scale[1]. The MPI application may have no progress in the case of node or network failures if such an application needs to exchange its computation results through the communication. Furthermore, the recovery overhead would be increased unless the fault detection services provide timely detection. On the other hand, the overhead of fault detection would be increased as the frequency of fault detection increases for monitoring accurate failures. It is still necessary to implement synthetic transparent fault detector on MPI standard for contemporary applications while there are many theoretical solutions for fault detection services because of the property of unreliable failure detectors, that is, completeness and accuracy[2].        

 

 

2. Outline for approaching a solution

Assuming that the system model provides certain temporal guarantees on communication or computation called partially synchronous [3], the Fault Detector (FD) is able to utilize time based scheme, namely, ping-ack based implementation.  The FD is a thread created by pthread_create() function which works independently under the application program. The FD uses two messages, ALIVE and ACK. ALIVE message is to check whether a destination node is alive or not. ACK message is to verify from a destination node. The FD should consider the time delay between two nodes with communication and computation time. The FD could suspect a destination node to be failed if no ACK message is received in correspondence to ALIVE message. The FD should be integrated into MPI environment with the following two properties:

¡¤        Transparency – The FD is launched in MPI_Init() routine with a profiling interface, which creates FD threads. The FD runs independently with a unique communicator different from an application program.         

¡¤        Scalability – The FD sends a check message sporadically at any time when an application program has a routine to communicate. It would not lead to high communication overhead compared with the frequency of periodic check message since the FD at each node avoids redundant check messages for a defined time period. 

 


Figure 1. The Diagram of Fault Detector

 

 

3. Implementation

Fault Detector (FD)

- Each node launches FD after passing MPI_Init().

- The FD consists two threads as followings:

   Sender : send the ALIVE message to the neighbor which exists in the queue.

   Receiver : receive both ACK and ALIVE message from the neighbor.

MPI- Routines

- Added pre/post routine in communication routines

    MPI_Send()

    MPI_Isend()

    MPI_Bcast()

    MPI_Reduce()

    MPI_Allreduce()

    MPI_Alltoall()

Wrapper Functions

- In order to connect with some benchmark written in Fortran

Utility Functions

- Queue management functions

- microsleep function

 

 

4. Test  

    - Test with NAS Parallel Benchmark on opt machine

       CG, IS, DT, BT, EP, MG, SP, LU, FT

 

 

5. Experimental Results

                                                 Figure 2. The performance results with FD

       

 

6. Conclusion

In this work, I implement sporadic Fault Detector based ping-ack messages. There are still some implementation issues. However, I can say the practical Fault Detector is able to be implemented with the following properties, transparency, scalability, and portability. The experimental results show that Fault Detector has negligible overhead for both communication and performance. I should add the global view list for node failures and how to consensus the difference among different global view lists.  I will implement this feature in my future work. 

 

 

7. Reports

- HW4(Problem Statement)

- HW5(Progress Report)

- HW6(Final Report)

- I do not want to open my source code. Please, email me if you need source code.  

 

 

8. References

[1] Jitsumoto, H., Endo, T., Matsuoka, S., "ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs," IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007) pp.1-8, March 2007.

[2] Tushar Deepak Chandra , Sam Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM (JACM), v.43 n.2, p.225-267, March 1996 

[3] Srikanth Sastry, Scott M. Pike , Jennifer L. Welch ¡°Crash fault detection in celerating environments¡± IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009) pp.1-12, 2009.

[4] http://moss.csc.ncsu.edu/~mueller/cluster/opt/

[5] Evaluation of Replication and Fault Detection in P2P-MPI, St¢¥ephane Genaud, Choopan Rattanapoka, IPDPS09