Reliability, Availability and Serviceability (RAS) for Petascale High-End Computing and Beyond


Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond is a collaborative computer science research effort of Oak Ridge National Laboratory (ORNL), Louisiana Tech University, and North Carolina State University in advanced software solutions for parallel and distributed computing systems with an emphasis on extreme-scale scientific high performance computing (HPC). Specifically, this project aims at providing high-level RAS for next-generation supercomputers to improve their resiliency (and ultimately efficiency) by performing research and development in novel high availability and fault tolerance system software solutions.

Based on virtualized adaptation, reconfiguration, and preemptive measures, the ultimate goal is to provide for non-stop scientific computing on a 24x7 basis without interruption. The taken technical approach leverages system-level virtualization technology to enable transparent proactive and reactive fault tolerance mechanisms on extreme scale HEC systems. This effort targets:

Publications:

Theses: