Operating and Runtime System Resilience on the Path to Exascale

funded by: SNL
funding level: $55,448
duration: 01/01/2012 - 12/31/2012

For large-scale high-performance computing (HPC) systems with 10s/100s of thousands of cores, faults have become the norm rather than the exception. To address this problem, we propose to develop and evaluate advanced mechanisms to protect the operating and runtime systems and thereby increase resilience to failures.

Publications:

"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, " in Supercomputing, Nov 2012, pages 2069-2072, DOI 10.1109/IPDPS.2011.379.
"Combining Partial Redundancy and Checkpointing for HPC" by J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, C. Engelmann in International Conference on Distributed Computing Systems, Jun 2012, DOI 10.1109/ICDCS.2012.56.
"Evaluating Operating System Vulnerability to Memory Errors" by Kurt B. Ferreira, Kevin Pedretti, Patrick G. Bridges, Ron Brightwell, David Fiala and Frank Mueller, Workshop on Runtime and Operating Systems for Supercomputers, Jun 2012, DOI 10.1145/2318916.2318930.
"A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" by D. Fiala, K. Ferreira, F. Mueller, C. Engelmann, Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep 2011, DIO 10.1007/978-3-642-29740-3_29.
"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, R. Riesen" in TR 2012-5, Dept. of Computer Science, North Carolina State University, May 2012.