Collaborative Research: Automatic Extraction of Parallel
I/O Benchmarks from HEC Applications
- funded by: NSF
(award abstract)
- funding level: $499,999 (for NCSU), $245,974 (for Rochester), $250,000 (for UIUC)
- duration: 09/15/2009 - 08/31/2012 (no-cost extension until 08/31/2014)
- PIs/co-PIs: Xiaosong Ma
and Frank Mueller (NCSU), Kai Shen (Rochester), Marianne Winslett (UIUC)
I/O performance is often an issue for high-end computing
(HEC) codes, due to
their increasingly data-intensive nature and the ever-growing CPU-I/O
performance gap. Portable parallel I/O benchmarks can help
(1) application developers to improve their codes' performance,
(2) HEC storage systems architects to improve their designs, and
(3) future and current owners of HEC platforms to reduce hardware cost
and
improve application performance through better system provisioning and
configuration.
To keep up with the growing scale and complexity of
HEC applications, this project develops automated generation of
parallel
I/O benchmarks, analogous to the SPEC and NAS
benchmarks for computation. Our approach will be embedded in
BenchMaker,
a prototype tool that takes a real-world, large-scale parallel
application and automatically distills it into a compact,
human-intelligible, I/O-intensive, and parameterized benchmark. Such a
benchmark
accurately reflects the original application's I/O characteristics and
I/O performance, yet with shorter execution time, reduced need for
libraries, better portability, and easy scalability.
This research will produce benchmarks and tools
that benefit the computational science community at large.
Our benchmark prototypes will be used for parallel computing
course projects and student research contests.
Publications:
-
Stan Park and Kai Shen,
"A Performance Evaluation of Scientific I/O Workloads on Flash-Based SSDs".
In Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS'09),
New Orleans, LA, September 2009.
Abstract.
Postscript version.
PDF version.
-
"Scalable I/O Tracing and Analysis" by Karthik Vijayakumar, Frank Mueller, Xiasong Ma, Philip C. Roth in Petascale Data Storage Workshop, Nov 2009.
-
"ScalaTrace: Scalable
Compression and Replay of Communication Traces in High Performance Computing"
by M. Noeth and P. Ratn and F. Mueller and M. Schulz and B. de
Supinski,
Journal of Parallel and Distributed Computing, V 69, No 8, Aug 2009, pages 969-710.
-
"ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs"
by X. Wu, F. Mueller
in ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, Feb 2011.
-
"Probabilistic Communication and I/O Tracing with Deterministic Replay
at Scale" by Xing Wu, Karthik Vijayakumar, Frank
Mueller, Xiaosong Ma, Philip C. Roth in International Conference
on Parallel Processing, Sep 2011 (accepted).
-
"GStream: A General-Purpose Data Streaming Framework on GPU
Clusters" by Yongpeng Zhang, Frank
Mueller in International Conference on Parallel Processing, Sep
2011.
-
Stan Park and Kai Shen,
"FIOS: A Fair, Efficient Flash I/O Scheduler".
In Proc. of the 10th USENIX Conference on File and Storage Technologies (FAST'12),
San Jose, CA, February 2012.
Abstract.
PDF version.
-
"ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs"
by X. Wu, F. Mueller
in ACM Transactions on Programming Languages, Vol. 34, No. 1, Apr
2012, DOI 10.1145/2160910.2160914.
-
"ScalaBenchGen:
Auto-Generation of Communication Benchmark Traces"
by X. Wu, V. Deshpande, F. Mueller, in
International Parallel and Distributed Processing Symposium, May 2012.
-
Stan Park, Terence Kelly, and Kai Shen,
"Failure-Atomic msync(): A Simple and Efficient Mechanism for Preserving the Integrity of Durable Data".
In Proc. of the EuroSys Conference (EuroSys'13),
Prague, Czech Republic, April 2013.
Abstract.
PDF version.
-
Kai Shen and Stan Park,
"FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs".
In Proc. of the USENIX Annual Technical Conference (USENIX ATC'13),
San Jose, CA, June 2013.
Abstract.
PDF version.
Conference talk slides in Powerpoint.
-
Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events
by X. Wu, F. Mueller
in International Conference on Supercomputing, Jun 2013, pages 59-68.
- ScalaJack: Customized Scalable Tracing with in-situ Data Analysis by S. Ananthakrishnan, Frank Mueller in
Euro-Par Conference, Aug 2014.
-
A Methodology for Automatic Generation of Executable Communication
Specifications from Parallel MPI Applications
by X. Wu, F. Mueller, S. Pakin
in ACM Transactions on Parallel Computing, Sep 2014.
Supplement
Theses:
-
"Automatic Generation
of Complete Communication Skeletons from Traces"
by Vivek Deshpande, M.S. Thesis, North Carolina
State University, Aug 2011 (last known position: Intel, OR)
-
"Scalable Communication Tracing for Performance Analysis of Parallel Applications"
by X. Wu, Ph.D. Thesis, North Carolina State
University, Dec 2012 (last known position: Amazon, WA)
-
"Customized
Scalable Tracing with in-situ Data Analysis"
by Srinash Krishna Ananthakrishnan, M.S. Thesis, North Carolina
State University, May 2013 (last known position: Riverbed Technologies, CA)
Software:
ScalaTrace:
ScalaTrace web site
Trace-Driven Scientific I/O Benchmarks:
Traces of scientific I/O workloads are being made available to enable computing-related
research. Examples include traces from
Sandia National Laboratories
and Los Alamos National Laboratory.
Useful statistics can be extracted from such traces. However, it is sometimes desirable
to run applications represented by the traces so as to evaluate the performance and other
behaviors of I/O and storage systems. For this purpose, we have created a trace player,
TracePlay/Control, written in C, to recreate some of the trace conditions in the form of
runnable benchmarks. Our trace player uses formatted traces that are derived from
original scientific I/O traces. Our player is less of a full-blown utility but more of
a benchmark using traces extracted from actual scientific applications. The benchmark
itself is largely just a shell or wrapper for a parsed trace.
In order to successfully replay a trace, the file system context must be recreated.
Our trace player extracts directory and file names accessed throughout the trace and
recreates the hierarchy. Files are created with the maximum estimated size inferred from
I/O system calls (read, write, seek). Our trace player is capable of running in two modes:
- Sequential (using traceplay): Sequential mode is essentially batch mode for traces.
It can replay a single trace. If given a list of traces, it will replay those traces
one after another, i.e. sequentially.
- Parallel (using tracecontrol): In parallel mode, two or more traces are replayed
concurrently.
- unsynchronized: an input list of traces is replayed concurrently by parallel
processes through the fork system call. All processes are allowed to run at best
possible speed.
- synchronized: Since many scientific applications use concurrent processes
with some form of synchronization, we also support a synchronized replay mode.
Synchronized mode requires that some version of MPI be installed (MPICH-2 was used
in our system) as MPI calls in the traces are used to enforce synchronization.
Download source: traceplayer_v0.9.zip
Traces below originated from those released by
Sandia National Laboratories
and Los Alamos National Laboratory.
We sanitized the original traces and converted them into the format suitable for our
trace player.
Installation and usage notes (also in the README file).
"This material is based upon work supported by the National Science Foundation under Grant No. 0937908."
"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."