RESYST: Resilience via Synergistic Redundancy and Fault Tolerance for High-End Computing

In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart (C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.

The objective of this work to develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.

This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.

Publications: Theses:
"This material is based upon work supported by the National Science Foundation under Grant No. 1058779."

"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."