Reliability, Availability and Serviceability (RAS) for Petascale High-End Computing and Beyond

funded by: DOE
funding level: $150,000 (for NCSU)
duration: 06/01/2008 - 05/31/2011
PIs (total funding: $1,950,000):
- Stephen L. Scott, Christian Engelmann, Hong Ong, Geoffroy Vallee - ORNL
- Frank Mueller - North Carolina State University
- Chokchai Leangsuksun, Mihaela Paun - Louisiana Tech University

Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond is a collaborative computer science research effort of Oak Ridge National Laboratory (ORNL), Louisiana Tech University, and North Carolina State University in advanced software solutions for parallel and distributed computing systems with an emphasis on extreme-scale scientific high performance computing (HPC). Specifically, this project aims at providing high-level RAS for next-generation supercomputers to improve their resiliency (and ultimately efficiency) by performing research and development in novel high availability and fault tolerance system software solutions.

Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets:

reliability analysis for identifying pre-fault indicators, predicting failures, and modeling and monitoring individual system component reliability as well as overall system reliability,
proactive fault tolerance technology based on preemptive migration of computation away from components that are about to fail using system-level virtualization,
reactive fault tolerance enhancements, such as checkpoint interval and placement adaption to actual and predicted system health threats, using system- and process-level virtualization, and
holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance in conjunction with system health monitoring and reliability analysis.

Publications:

"On-the-fly Recovery of Job Input Data in Supercomputers" by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma, and F. Mueller in International Conference on Parallel Processing, Sep 2008, pages 620-627.
"Proactive Process-Level Live Migration in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in Supercomputing, Nov 2008.
"Improving the Availability of Supercomputer Job Input Data Using Temporal Replication" by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma, and F. Mueller in International Supercomputing Conference, Jun 2009, pages 149-157.
"Back-Migration for MPI Jobs in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in Forum to Address Scalable Technology for runtime and Operating Systems (FastOS), Jun 2009.
A Tunable Holistic Resiliency Approach for High-Performance Computing Systems by S. Scott, C. Engelmann, G. Vallee, T. Naughton, A. Tikotekar, G. Ostrouchov, C. Leangsuksun, N. Naksinehaboon, R. Nassar, M. Paun, F. Mueller, C. Wang, A. Nagarajan, J. Varma , refereed poster at PPoPP, Feb 2009.
"Proactive Process-Level Live Migration and Back Migration in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in TR 2009-14, Dept. of Computer Science, North Carolina State University, Jun 2009.
"Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in TR 2009-14, Dept. of Computer Science, North Carolina State University, Jun 2009.
"Hybrid Checkpointing for MPI Jobs in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in International Conference on Parallel and Distributed Systems (ICPADS), Dec 2010.

Theses:

"Transparent Fault Tolerance for Job Healing in HPC Environments" by C. Wang, Ph.D. Thesis, North Carolina State University, Jun 2009 (last known position: post-doc at ORNL)