Online Data Reconstruction for Supercomputers

funded by: ORNL
funding level: $21,000
duration: 01/01/2007 - 06/30/2007

This work seeks to build online recovery mechanisms for transient supercomputer job data. With our on-demand data reconstruction, staged input files that are unavailable due to I/O node failures in a parallel file system are transparently patched from source copies using the recovery metadata. To this end, this work will leverage and build on previous research on (1) extending parallel file system metadata with recovery hints and (2) offline data recovery. Our on-demand data reconstruction, which takes advantage of the existence of external data sources as well as the immutable nature of scientific datasets on supercomputers, easily handles both underlying storage device failures and file system server node failures.

Publications:

"Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery" by Z. Zhang, C. Wang, S. Vazhkudai, X. Ma, G. Pike, J. Cobb and F. Mueller in Supercomputing, Nov 2007.
"On-the-fly Recovery of Job Input Data in Supercomputers" by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma, and F. Mueller in International Conference on Parallel Processing, Sep 2008, pages 620-627.
"Improving the Availability of Supercomputer Job Input Data Using Temporal Replication" by C. Wang, Z. Zhang, S. Vazhkudai, X. Ma, and F. Mueller in International Supercomputing Conference, Jun 2009, pages 149-157.