Collaborative Research: Automatic Extraction of Parallel I/O Benchmarks from HEC Applications

funded by: NSF (award abstract)
funding level: $499,999 (for NCSU), $245,974 (for Rochester), $250,000 (for UIUC)
duration: 09/15/2009 - 08/31/2012 (no-cost extension until 08/31/2014)
PIs/co-PIs: Xiaosong Ma and Frank Mueller (NCSU), Kai Shen (Rochester), Marianne Winslett (UIUC)

I/O performance is often an issue for high-end computing (HEC) codes, due to their increasingly data-intensive nature and the ever-growing CPU-I/O performance gap. Portable parallel I/O benchmarks can help
(1) application developers to improve their codes' performance,
(2) HEC storage systems architects to improve their designs, and
(3) future and current owners of HEC platforms to reduce hardware cost and improve application performance through better system provisioning and configuration.

To keep up with the growing scale and complexity of HEC applications, this project develops automated generation of parallel I/O benchmarks, analogous to the SPEC and NAS benchmarks for computation. Our approach will be embedded in BenchMaker, a prototype tool that takes a real-world, large-scale parallel application and automatically distills it into a compact, human-intelligible, I/O-intensive, and parameterized benchmark. Such a benchmark accurately reflects the original application's I/O characteristics and I/O performance, yet with shorter execution time, reduced need for libraries, better portability, and easy scalability.

This research will produce benchmarks and tools that benefit the computational science community at large. Our benchmark prototypes will be used for parallel computing course projects and student research contests.

Publications:

Stan Park and Kai Shen, "A Performance Evaluation of Scientific I/O Workloads on Flash-Based SSDs". In Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS'09), New Orleans, LA, September 2009. Abstract. Postscript version. PDF version.
"Scalable I/O Tracing and Analysis" by Karthik Vijayakumar, Frank Mueller, Xiasong Ma, Philip C. Roth in Petascale Data Storage Workshop, Nov 2009.
"ScalaTrace: Scalable Compression and Replay of Communication Traces in High Performance Computing" by M. Noeth and P. Ratn and F. Mueller and M. Schulz and B. de Supinski, Journal of Parallel and Distributed Computing, V 69, No 8, Aug 2009, pages 969-710.
"ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs" by X. Wu, F. Mueller in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2011.
"Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale" by Xing Wu, Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, Philip C. Roth in International Conference on Parallel Processing, Sep 2011 (accepted).
"GStream: A General-Purpose Data Streaming Framework on GPU Clusters" by Yongpeng Zhang, Frank Mueller in International Conference on Parallel Processing, Sep 2011.
Stan Park and Kai Shen, "FIOS: A Fair, Efficient Flash I/O Scheduler". In Proc. of the 10th USENIX Conference on File and Storage Technologies (FAST'12), San Jose, CA, February 2012. Abstract. PDF version.
"ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs" by X. Wu, F. Mueller in ACM Transactions on Programming Languages, Vol. 34, No. 1, Apr 2012, DOI 10.1145/2160910.2160914.
"ScalaBenchGen: Auto-Generation of Communication Benchmark Traces" by X. Wu, V. Deshpande, F. Mueller, in International Parallel and Distributed Processing Symposium, May 2012.
Stan Park, Terence Kelly, and Kai Shen, "Failure-Atomic msync(): A Simple and Efficient Mechanism for Preserving the Integrity of Durable Data". In Proc. of the EuroSys Conference (EuroSys'13), Prague, Czech Republic, April 2013. Abstract. PDF version.
Kai Shen and Stan Park, "FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs". In Proc. of the USENIX Annual Technical Conference (USENIX ATC'13), San Jose, CA, June 2013. Abstract. PDF version. Conference talk slides in Powerpoint.
Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events by X. Wu, F. Mueller in International Conference on Supercomputing, Jun 2013, pages 59-68.
ScalaJack: Customized Scalable Tracing with in-situ Data Analysis by S. Ananthakrishnan, Frank Mueller in Euro-Par Conference, Aug 2014.
A Methodology for Automatic Generation of Executable Communication Specifications from Parallel MPI Applications by X. Wu, F. Mueller, S. Pakin in ACM Transactions on Parallel Computing, Sep 2014. Supplement

Theses:

"Automatic Generation of Complete Communication Skeletons from Traces" by Vivek Deshpande, M.S. Thesis, North Carolina State University, Aug 2011 (last known position: Intel, OR)
"Scalable Communication Tracing for Performance Analysis of Parallel Applications" by X. Wu, Ph.D. Thesis, North Carolina State University, Dec 2012 (last known position: Amazon, WA)
"Customized Scalable Tracing with in-situ Data Analysis" by Srinash Krishna Ananthakrishnan, M.S. Thesis, North Carolina State University, May 2013 (last known position: Riverbed Technologies, CA)

Software:

ScalaTrace:
ScalaTrace web site

Trace-Driven Scientific I/O Benchmarks:

Traces of scientific I/O workloads are being made available to enable computing-related research. Examples include traces from Sandia National Laboratories and Los Alamos National Laboratory. Useful statistics can be extracted from such traces. However, it is sometimes desirable to run applications represented by the traces so as to evaluate the performance and other behaviors of I/O and storage systems. For this purpose, we have created a trace player, TracePlay/Control, written in C, to recreate some of the trace conditions in the form of runnable benchmarks. Our trace player uses formatted traces that are derived from original scientific I/O traces. Our player is less of a full-blown utility but more of a benchmark using traces extracted from actual scientific applications. The benchmark itself is largely just a shell or wrapper for a parsed trace.

In order to successfully replay a trace, the file system context must be recreated. Our trace player extracts directory and file names accessed throughout the trace and recreates the hierarchy. Files are created with the maximum estimated size inferred from I/O system calls (read, write, seek). Our trace player is capable of running in two modes:

Sequential (using traceplay): Sequential mode is essentially batch mode for traces. It can replay a single trace. If given a list of traces, it will replay those traces one after another, i.e. sequentially.
Parallel (using tracecontrol): In parallel mode, two or more traces are replayed concurrently.
- unsynchronized: an input list of traces is replayed concurrently by parallel processes through the fork system call. All processes are allowed to run at best possible speed.
- synchronized: Since many scientific applications use concurrent processes with some form of synchronization, we also support a synchronized replay mode. Synchronized mode requires that some version of MPI be installed (MPICH-2 was used in our system) as MPI calls in the traces are used to enforce synchronization.

Download source: traceplayer_v0.9.zip

Traces below originated from those released by Sandia National Laboratories and Los Alamos National Laboratory. We sanitized the original traces and converted them into the format suitable for our trace player.

Sanitized and formatted Sandia traces:
sandia.alegra.2744.zip
sandia.cth.zip
sandia.s3d_fortIO_031709.zip
Sanitized and formatted LANL traces:
32PE_N-1_nonstrided_64K_MPI.zip
32PE_N-1_nonstrided_1024K_MPI.zip

Installation and usage notes (also in the README file).

"This material is based upon work supported by the National Science Foundation under Grant No. 0937908."

"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."