RESYST: Resilience via Synergistic Redundancy and Fault Tolerance for High-End Computing

funded by: NSF (award abstract)
funding level: $376,219
duration: 10/01/2010 - 09/30/2013 (no-cost extension until 09/30/2016)
PI: Frank Mueller

In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart (C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.

The objective of this work to develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.

This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.

Publications:

"FuncyTuner: Auto-tuning Scientific Applications With Per-loop Compilation" by Tao Wang, Nikhil Jain, David Beckingsale, David Boehme, Frank Mueller, Todd Gamblin in International Conference on Parallel Processing (ICPP), Aug 2019.
Hybrid MPI/OpenMP Programming on the Tilera Manycore Architecture by Vishwanathan Chandu, Frank Mueller in International Conference on High Performance Computing & Simulation (HPCS), Jul 2016.
Efficient and Predictable Group Communication for Manycore NoCs by Karthik Yagna, Onkar Patil, Frank Mueller in International Supercomputing Conference (ISC), Jun 2016.
Distributed Job Allocation for Large-Scale Manycores by Subramanian Ramachandran, Frank Mueller in International Supercomputing Conference (ISC), Jun 2016.
Mini-Ckpts: Surviving OS Failures in Persistent Memory by David Fiala, Frank Mueller, Kurt Ferreira, Christian Engelmann in International Conference on Supercomputing (ICS), Jun 2016.
TintMalloc: Reducing Memory Access Divergence via Controller-Aware Coloring Xing Pan, Yasaswini Gownivaripalli, Frank Mueller in International Parallel and Distributed Processing Symposium (IPDPS), May 2016.
Reducing NoC and Memory Contention for Manycores by V. Chandru, F. Mueller in Architecture of Computing Systems (ARCS), Apr 2016.
"DINO: Divergent Node Cloning for Sustained Redundancy in HPC" by A. Rezaei, F. Mueller in Cluster, Sep 2015, pages 180-183.
"Affinity-Aware Checkpoint Restart" by A. Saini, A. Rezaei, F. Mueller, P. Hargrove, E. Roman in Middleware, Dec 2014.
"Skeptical Programming and Selective Reliability" by James Elliott, Mark Hoemmen, Frank Mueller, refereed poster at Supercomputing, Nov 2014.
"Exploiting Data Representation for Fault Tolerance" by James Elliott, Mark Hoemmen, and Frank Mueller, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), Nov 2014.
Snapify: Capturing Snapshots of Offload Applications on Xeon Phi Manycore Processors by Arash Rezaei, Guiseppe Coviello, Cheng-Hong Li, Srimat Chakradhar, Frank Mueller in High-Performance Parallel and Distributed Computing, Jun 2014.
Evaluating the Impact of SDC on the GMRES Iterative Solver by James Elliott, Mark Hoemmen, Frank Mueller in International Parallel and Distributed Processing Symposium, May 2014.
"Resilience in Numerical Methods: A Position on Fault Models and Methodologies" by J. Elliott, M. Hoemmen, F. Mueller", invited talk at SIAM Conference on Computational Science and Engineering, Feb 2014.
"Tolerating Silent Data Corruption in Opaque Preconditioners" by J. Elliott, M. Hoemmen, F. Mueller", Computing Research Repository, Feb 2014.
"Sustained Resilience via Live Process Cloning" by Arash Rezaei, Frank Mueller, Workshop on Dependable Parallel, Distributed and Network-Centric Systems, May 2013.
"Auto-Generation and Auto-Tuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters" by Y. Zhang and F. Mueller in Transactions on Parallel and Distributed Systems, Vol. 24, No. 3, Mar 2013, pages 417-427, DOI 10.1109/TPDS.2012.160.
Highly Efficient and Predictable Group Communication over Multi-core NoCs by K. Yagna, F. Mueller , refereed work-in-progress RTAS, Apr 2013.
"Exploiting Data Representation for Fault Tolerance" by J. Elliott, M. Hoemmen, F. Mueller", Computing Research Repository, Feb 2013.
"Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic"by J. Elliott, F. Mueller, M. Stoyanov, C. Webster", invited talk at SIAM Conference on Computational Science and Engineering, Feb 2013, see TR 2013-2, Dept. of Computer Science, North Carolina State University, Mar 2013.
"Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic"by J. Elliott, F. Mueller, M. Stoyanov, C. Webster", invited talk at Smoky Mountains Computational Sciences and Engineering Conference, Sep 2012, see TR 2013-2, Dept. of Computer Science, North Carolina State University, Mar 2013.
"HiDP: A Hierarchical Data Parallel Language" by Y. Zhang and F. Mueller in International Symposium on Code Generation and Optimization, Feb 2013, accepted.
"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, " in Supercomputing, Nov 2012, pages 78:1--78:12.
"CuNesl: Compiling Nested Data-Parallel Languages for SIMT Architectures" by Y. Zhang, Frank Mueller in International Conference on Parallel Processing, Sep 2012, DOI 10.1109/ICPP.2012.21.
"Combining Partial Redundancy and Checkpointing for HPC" by J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, C. Engelmann in International Conference on Distributed Computing Systems, Jun 2012, DOI 10.1109/ICDCS.2012.56.
"Evaluating Operating System Vulnerability to Memory Errors" by Kurt B. Ferreira, Kevin Pedretti, Patrick G. Bridges, Ron Brightwell, David Fiala and Frank Mueller, Workshop on Runtime and Operating Systems for Supercomputers, Jun 2012, DOI 10.1145/2318916.2318930.
"ScalaBenchGen: Auto-Generation of Communication Benchmark Traces" by X. Wu, V. Deshpande, F. Mueller, in International Parallel and Distributed Processing Symposium, May 2012 DOI 10.1109/IPDPS.2012.114.
"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, R. Riesen" in TR 2012-5, Dept. of Computer Science, North Carolina State University, May 2012.
"Proactive Process-Level Live Migration and Back Migration in HPC Environments" by C. Wang, F. Mueller, C. Engelmann and S. Scott in Journal of Parallel and Distributed Computing, V 72, No 2, Feb 2012, pages 254-267, DOI 10.1016/j.jpdc.2011.10.009.
"Assessing HPC Failure Detectors for MPI Jobs" by K. Kharbas, D. Kim, T. Hoefler and F. Mueller in Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Feb 2012, pages 81-88.
"A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" by David Fiala, Kurt Ferreira, Frank Mueller, Christian Engelmann, refereed poster at Supercomputing, Nov 2011.
"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance" by David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, refereed poster at Supercomputing, Nov 2011.
"A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" by D. Fiala, K. Ferreira, F. Mueller, C. Engelmann, Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep 2011, DIO 10.1007/978-3-642-29740-3_29.
"Comparing different approaches for Incremental Checkpointing: The Showdown" by M. Vasavada, F. Mueller, P. Hargrove in Linux Symposium, Jun 2011, pages 69-79.
"Failure Detection within MPI Jobs: Periodic Outperforms Sporadic" by K. Kharbas, D. Kim, K. KC, T. Hoefler and F. Mueller" in TR 2011-13, Dept. of Computer Science, North Carolina State University, Jun 2011.

Theses:

"Analysis of Memory Performance and Execution Models for Large-Scale Manycores" by Vishwanathan Chandru, M.S. Thesis, North Carolina State University, Aug 2015 (last known position: Intel, IL)
"Distributed Job Allocation for Large-Scale Many-cores" by Subramanian Ramachandran, M.S. Thesis, North Carolina State University, May 2014 (last known position: Riverbed, CA)
"Collective Communication for Multi-core NOC Interconnects" by Karthik Yagna, M.S. Thesis, North Carolina State University, May 2013 (last known position: Riverbed technologies, CA)
"Exploiting Data-Parallelism in GPUs" by Y. Zhang, Ph.D. Thesis, North Carolina State University, Sep 2012 (last known position: Stone Ridge Technologies, MD)
"Failure Detection and Partial Redundancy in HPC" by Kirshor Kharbas, M.S. Thesis, North Carolina State University, Aug 2011 (last known position: Intel, OR)
"Design and Implementation of Process Migration and Cloning in BLCR" by Shobit Mishra, M.S. Thesis, North Carolina State University, Aug 2011 (last known position: Intel, CA)

"This material is based upon work supported by the National Science Foundation under Grant No. 1058779."

"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."