Resilience for Global Address Spaces
- funded by: LBNL under the
DEGAS project
- funding level: $203,393
- duration: 9/24/2013 - 08/15/2016
The objective of this work is to provide functionality for the BLCR
Linux module under a PGAS runtime system (within the DEGAS software
stack) to support advanced fault-tolerant capabilities, which are of
specific value in the context of large-scale computational science
codes running on high-end clusters and, ultimately, exascale
facilities. Our proposal is to develop and integrate into DEGAS a set
of advanced techniques to reduce the checkpoint/restart (C/R)
overhead.
Publications:
-
"DINO: Divergent Node Cloning for Sustained Redundancy in HPC"
by A. Rezaei, F. Mueller
in Cluster, Sep 2015, pages 180-183.
-
"Affinity-Aware Checkpoint Restart"
by A. Saini, A. Rezaei, F. Mueller, P. Hargrove, E. Roman
in Middleware, Dec 2014, pages 121-132.
-
"DINO: Divergent Node Cloning for Sustained Redundancy in HPC" by A. Rezaei, F. Mueller
in TR 2014-7, Dept. of Computer Science, North Carolina State
University, Jun 2014.
-
"Sustained Resilience via Live Process Cloning"
by Arash Rezaei, Frank Mueller, Workshop on
Dependable Parallel, Distributed and Network-Centric Systems, May 2013.
Theses: