Operating and Runtime System Resilience on the Path to Exascale
- funded by: SNL
- funding level: $55,448
- duration: 01/01/2012 - 12/31/2012
For large-scale high-performance computing (HPC) systems with 10s/100s
of thousands of cores, faults have become the norm rather than the
exception. To address this problem, we propose to develop and
evaluate advanced mechanisms to protect the operating and runtime
systems and thereby increase resilience to failures.
Publications:
-
"Detection and Correction of Silent Data Corruption for Large-Scale
High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, "
in Supercomputing, Nov 2012, pages 2069-2072, DOI 10.1109/IPDPS.2011.379.
-
"Combining Partial Redundancy and Checkpointing for HPC" by J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira,
C. Engelmann in International
Conference on Distributed Computing Systems, Jun 2012, DOI 10.1109/ICDCS.2012.56.
-
"Evaluating Operating System
Vulnerability to Memory Errors"
by Kurt B. Ferreira, Kevin Pedretti, Patrick G. Bridges, Ron Brightwell, David Fiala and Frank Mueller, Workshop on
Runtime and Operating Systems for Supercomputers, Jun 2012, DOI 10.1145/2318916.2318930.
-
"A Tunable, Software-based
DRAM Error Detection and Correction Library for HPC"
by D. Fiala, K. Ferreira, F. Mueller,
C. Engelmann, Workshop on Resiliency in High Performance Computing
(Resilience) in Clusters, Clouds, and Grids, Sep 2011, DIO 10.1007/978-3-642-29740-3_29.
-
"Detection and Correction of Silent Data Corruption for Large-Scale
High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann,
K. Ferreira, R. Brightwell, R. Riesen"
in TR 2012-5, Dept. of Computer Science, North Carolina State
University, May 2012.