RESYST: Resilience via
Synergistic
Redundancy and Fault Tolerance for High-End Computing
- funded by: NSF
(award
abstract)
- funding level: $376,219
- duration: 10/01/2010 - 09/30/2013 (no-cost extension until 09/30/2016)
- PI: Frank Mueller
In High-End Computing (HEC), faults have become the norm rather than
the exception for parallel computation on clusters with 10s/100s of
thousands of cores. As the core count increases, so does the overhead
for fault-tolerant techniques relying on checkpoint/restart
(C/R) mechanisms. At 50% overheads, redundancy is a viable
alternative to fault recovery and actually scales, which
makes the approach attractive for HEC.
The objective of this work to develop a synergistic approach by
combining C/R-based fault tolerance with redundancy in HEC
installations to
achieve high levels of resilience.
This work alleviates scalability limitations of current fault tolerant
practices. It contributes to fault modeling as well as fault detection
and recovery in significantly advancing existing techniques by
controlling levels of redundancy and checkpointing intervals in the
presence of faults. It is transformative in providing a model where
users select a target failure probability at the price of using
additional resources.
Publications:
-
"FuncyTuner: Auto-tuning Scientific Applications With Per-loop
Compilation" by Tao Wang, Nikhil Jain, David
Beckingsale, David Boehme, Frank
Mueller, Todd Gamblin in International Conference
on Parallel Processing (ICPP), Aug 2019.
- Hybrid MPI/OpenMP
Programming on the Tilera Manycore Architecture
by Vishwanathan Chandu, Frank Mueller
in International Conference on High Performance Computing & Simulation (HPCS), Jul 2016.
- Efficient and
Predictable Group Communication for Manycore NoCs
by Karthik Yagna, Onkar Patil, Frank Mueller
in International Supercomputing Conference (ISC), Jun 2016.
- Distributed Job
Allocation for Large-Scale Manycores
by Subramanian Ramachandran, Frank Mueller
in International Supercomputing Conference (ISC), Jun 2016.
- Mini-Ckpts: Surviving
OS Failures in Persistent Memory
by David Fiala, Frank Mueller, Kurt Ferreira,
Christian Engelmann
in International Conference on Supercomputing (ICS), Jun 2016.
- TintMalloc:
Reducing Memory Access Divergence via Controller-Aware Coloring
Xing Pan, Yasaswini Gownivaripalli, Frank Mueller in
International Parallel and Distributed Processing Symposium (IPDPS), May 2016.
-
Reducing NoC and Memory Contention for Manycores
by V. Chandru, F. Mueller
in Architecture of Computing Systems (ARCS), Apr 2016.
-
"DINO: Divergent Node Cloning for Sustained Redundancy in HPC"
by A. Rezaei, F. Mueller
in Cluster, Sep 2015, pages 180-183.
-
"Affinity-Aware Checkpoint Restart"
by A. Saini, A. Rezaei, F. Mueller, P. Hargrove, E. Roman
in Middleware, Dec 2014.
-
"Skeptical Programming and Selective Reliability" by
James Elliott, Mark Hoemmen, Frank Mueller, refereed poster at Supercomputing, Nov 2014.
-
"Exploiting Data Representation for Fault Tolerance"
by James Elliott, Mark Hoemmen, and Frank
Mueller, Workshop on Latest Advances in Scalable Algorithms for
Large-Scale Systems (ScalA), Nov 2014.
- Snapify: Capturing
Snapshots of Offload Applications on Xeon Phi Manycore
Processors by Arash Rezaei, Guiseppe Coviello,
Cheng-Hong Li, Srimat Chakradhar, Frank Mueller in
High-Performance Parallel and Distributed Computing, Jun 2014.
- Evaluating the Impact of SDC on the GMRES Iterative Solver by James Elliott, Mark Hoemmen, Frank Mueller in
International Parallel and Distributed Processing Symposium, May 2014.
-
"Resilience in Numerical Methods: A Position on
Fault Models and Methodologies" by J. Elliott, M. Hoemmen, F. Mueller", invited talk at SIAM Conference on
Computational Science and Engineering, Feb 2014.
-
"Tolerating Silent Data Corruption in Opaque Preconditioners" by J. Elliott, M. Hoemmen, F. Mueller", Computing Research Repository, Feb 2014.
-
"Sustained Resilience via Live Process Cloning"
by Arash Rezaei, Frank Mueller, Workshop on
Dependable Parallel, Distributed and Network-Centric Systems, May 2013.
-
"Auto-Generation and
Auto-Tuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters" by Y. Zhang and F. Mueller in Transactions on
Parallel and Distributed Systems, Vol. 24, No. 3, Mar 2013, pages 417-427, DOI 10.1109/TPDS.2012.160.
-
Highly Efficient and Predictable Group Communication over Multi-core NoCs
by K. Yagna, F. Mueller
, refereed work-in-progress RTAS, Apr 2013.
-
"Exploiting Data Representation for Fault
Tolerance" by J. Elliott, M. Hoemmen,
F. Mueller", Computing Research Repository, Feb 2013.
-
"Quantifying the Impact of Single Bit Flips on Floating Point
Arithmetic"by J. Elliott, F. Mueller,
M. Stoyanov, C. Webster", invited talk at SIAM Conference on
Computational Science and Engineering, Feb 2013,
see TR 2013-2,
Dept. of Computer Science, North Carolina State University, Mar 2013.
-
"Quantifying the Impact of Single Bit Flips on Floating Point
Arithmetic"by J. Elliott, F. Mueller,
M. Stoyanov, C. Webster", invited talk at Smoky Mountains
Computational Sciences and Engineering Conference, Sep 2012,
see TR 2013-2,
Dept. of Computer Science, North Carolina State University, Mar 2013.
-
"HiDP: A Hierarchical Data
Parallel Language" by Y. Zhang and F. Mueller in International
Symposium on Code Generation and Optimization, Feb 2013, accepted.
-
"Detection and Correction of Silent Data Corruption for Large-Scale
High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, R. Brightwell, "
in Supercomputing, Nov 2012, pages 78:1--78:12.
-
"CuNesl: Compiling Nested Data-Parallel Languages for SIMT
Architectures" by Y. Zhang, Frank
Mueller in International Conference
on Parallel Processing, Sep 2012, DOI 10.1109/ICPP.2012.21.
-
"Combining Partial Redundancy and Checkpointing for HPC" by J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira,
C. Engelmann in International
Conference on Distributed Computing Systems, Jun 2012, DOI 10.1109/ICDCS.2012.56.
-
"Evaluating Operating System
Vulnerability to Memory Errors"
by Kurt B. Ferreira, Kevin Pedretti, Patrick G. Bridges, Ron Brightwell, David Fiala and Frank Mueller, Workshop on
Runtime and Operating Systems for Supercomputers, Jun 2012, DOI 10.1145/2318916.2318930.
-
"ScalaBenchGen:
Auto-Generation of Communication Benchmark Traces"
by X. Wu, V. Deshpande, F. Mueller, in
International Parallel and Distributed Processing Symposium, May 2012
DOI 10.1109/IPDPS.2012.114.
-
"Detection and Correction of Silent Data Corruption for Large-Scale
High-Performance Computing" by D. Fiala, F. Mueller, C. Engelmann,
K. Ferreira, R. Brightwell, R. Riesen"
in TR 2012-5, Dept. of Computer Science, North Carolina State
University, May 2012.
-
"Proactive Process-Level Live Migration and Back Migration in HPC Environments"
by C. Wang, F. Mueller, C. Engelmann and S. Scott
in Journal of Parallel and Distributed Computing, V 72,
No 2, Feb 2012, pages 254-267, DOI 10.1016/j.jpdc.2011.10.009.
-
"Assessing HPC Failure Detectors for MPI Jobs"
by K. Kharbas, D. Kim, T. Hoefler and
F. Mueller
in Euromicro International Conference on Parallel, Distributed and
Network-Based Computing, Feb 2012, pages 81-88.
-
"A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" by
David Fiala, Kurt Ferreira, Frank Mueller, Christian Engelmann, refereed poster at Supercomputing, Nov 2011.
-
"Detection and Correction of Silent Data Corruption for Large-Scale High-Performance" by
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt
Ferreira, refereed poster at Supercomputing, Nov 2011.
-
"A Tunable, Software-based
DRAM Error Detection and Correction Library for HPC"
by D. Fiala, K. Ferreira, F. Mueller,
C. Engelmann, Workshop on Resiliency in High Performance Computing
(Resilience) in Clusters, Clouds, and Grids, Sep 2011, DIO 10.1007/978-3-642-29740-3_29.
-
"Comparing different approaches for Incremental Checkpointing: The Showdown"
by M. Vasavada, F. Mueller, P. Hargrove
in Linux Symposium, Jun 2011, pages 69-79.
-
"Failure Detection within MPI Jobs: Periodic Outperforms Sporadic"
by K. Kharbas, D. Kim, K. KC, T. Hoefler and
F. Mueller"
in TR 2011-13, Dept. of Computer Science, North Carolina State
University, Jun 2011.
Theses:
-
"Analysis of Memory
Performance and Execution Models for Large-Scale Manycores"
by Vishwanathan Chandru, M.S. Thesis, North Carolina State
University, Aug 2015 (last known position: Intel, IL)
-
"Distributed Job Allocation for Large-Scale Many-cores"
by Subramanian Ramachandran, M.S. Thesis, North Carolina State University, May 2014
(last known position: Riverbed, CA)
-
"Collective Communication
for Multi-core NOC Interconnects" by Karthik
Yagna, M.S. Thesis, North Carolina State University, May 2013
(last known position: Riverbed technologies, CA)
-
"Exploiting Data-Parallelism in GPUs"
by Y. Zhang, Ph.D. Thesis, North Carolina State
University, Sep 2012 (last known position: Stone Ridge Technologies, MD)
-
"Failure Detection and
Partial Redundancy in HPC" by Kirshor Kharbas,
M.S. Thesis, North Carolina State University, Aug 2011 (last known
position: Intel, OR)
-
"Design and
Implementation of Process Migration and Cloning in BLCR"
by Shobit Mishra, M.S. Thesis, North Carolina
State University, Aug 2011 (last known position: Intel, CA)
"This material is based upon work supported by the National Science Foundation under Grant No. 1058779."
"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."