Reliability, Availability and Serviceability (RAS)
for Petascale High-End Computing and Beyond
- funded by: DOE
- funding level: $150,000 (for NCSU)
- duration: 06/01/2008 - 05/31/2011
- PIs (total funding: $1,950,000):
- Stephen L. Scott, Christian Engelmann, Hong Ong, Geoffroy Vallee - ORNL
- Frank Mueller - North Carolina State University
- Chokchai Leangsuksun, Mihaela Paun - Louisiana Tech University
Reliability, Availability, and
Serviceability (RAS) for Petascale High-End Computing and Beyond is a
collaborative computer science research effort of Oak Ridge National
Laboratory (ORNL), Louisiana Tech
University, and North Carolina State
University in advanced software solutions for parallel and distributed
computing systems with an emphasis on extreme-scale scientific high
performance computing (HPC). Specifically, this project aims at
providing high-level RAS for next-generation supercomputers to improve
their resiliency (and ultimately efficiency) by performing research
and development in novel high availability and fault tolerance system
software solutions.
Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets:
- reliability analysis for identifying pre-fault indicators,
predicting failures, and modeling and monitoring individual system
component reliability as well as overall system reliability,
- proactive fault tolerance technology based on preemptive
migration of computation away from components that are about to fail
using system-level virtualization,
- reactive fault tolerance enhancements, such as checkpoint
interval and placement adaption to actual and predicted system health
threats, using system- and process-level virtualization, and
- holistic fault tolerance through combination of adaptive proactive and reactive fault tolerance in conjunction with system health monitoring and reliability analysis.
Publications:
-
"On-the-fly Recovery of Job Input Data in Supercomputers"
by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma,
and F. Mueller
in International Conference on Parallel Processing, Sep 2008, pages 620-627.
-
"Proactive Process-Level Live Migration in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in Supercomputing, Nov 2008.
-
"Improving the Availability of Supercomputer Job Input Data Using Temporal Replication"
by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma,
and F. Mueller
in International Supercomputing Conference, Jun 2009, pages 149-157.
-
"Back-Migration for MPI Jobs in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in Forum to Address Scalable Technology for runtime and Operating Systems (FastOS), Jun 2009.
-
A Tunable Holistic Resiliency Approach for High-Performance Computing
Systems by S. Scott, C. Engelmann, G. Vallee,
T. Naughton, A. Tikotekar, G. Ostrouchov,
C. Leangsuksun, N. Naksinehaboon, R. Nassar, M. Paun, F. Mueller,
C. Wang, A. Nagarajan, J. Varma , refereed poster at PPoPP, Feb 2009.
-
"Proactive Process-Level Live Migration and Back Migration in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in TR 2009-14, Dept. of Computer Science, North Carolina State
University, Jun 2009.
-
"Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in TR 2009-14, Dept. of Computer Science, North Carolina State
University, Jun 2009.
-
"Hybrid Checkpointing for MPI Jobs in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in International Conference on Parallel and Distributed Systems (ICPADS), Dec 2010.
Theses: