Operating and Runtime System Resilience on the Path to Exascale

For large-scale high-performance computing (HPC) systems with 10s/100s of thousands of cores, faults have become the norm rather than the exception. To address this problem, we propose to develop and evaluate advanced mechanisms to protect the operating and runtime systems and thereby increase resilience to failures.