This work seeks to build online recovery mechanisms for transient supercomputer job data. With our on-demand data reconstruction, staged input files that are unavailable due to I/O node failures in a parallel file system are transparently patched from source copies using the recovery metadata. To this end, this work will leverage and build on previous research on (1) extending parallel file system metadata with recovery hints and (2) offline data recovery. Our on-demand data reconstruction, which takes advantage of the existence of external data sources as well as the immutable nature of scientific datasets on supercomputers, easily handles both underlying storage device failures and file system server node failures.
Related ORNL project: Robust Supercomputing I/O for Petascale Environments
Publications: