Resilience for Global Address Spaces


The objective of this work is to provide functionality for the BLCR Linux module under a PGAS runtime system (within the DEGAS software stack) to support advanced fault-tolerant capabilities, which are of specific value in the context of large-scale computational science codes running on high-end clusters and, ultimately, exascale facilities. Our proposal is to develop and integrate into DEGAS a set of advanced techniques to reduce the checkpoint/restart (C/R) overhead.

Publications:

Theses: