BLCR Support for Job Pause, Live Migration and Incremental Checkpointing


The objective of this work is to provide functionality for the Berkeley Lab Checkpoint/Restart (BLCR) Linux module to support advanced fault-tolerant capabilities, which are of specific value in the context of large-scale computational science codes running on high-end clusters. We have developed a set of techniques to reduce this checkpoint/restart overhead. We propose to integrate a job pause mechanism, live migration support and an incremental checkpoiting mechanism into the latest BLCR version.

Publications:

Theses: