ScalaJack: Scalable Trace-Based Tools for In-Situ Data Analysis of HPC Applications
- funded by: NSF
(award
abstract)
- funding level: $457,395
- duration: 06/01/2012 - 05/31/2015 (no-cost extension until 05/31/2017)
- PI: Frank Mueller
Production codes on supercomputers are struggling to remain scalable
each time the processor core count increases by a factor of 10, even
though they run efficiently at smaller scale.
But root cause diagnosis fails at petascale since (1) symptoms of
performance problems can be subtle, (2) only few
metrics can be efficiently collected and (3) tools can only feasibly record
a small subset of even these metrics.
This work addresses these problems by creating a framework that allows
application developers to focus on data analysis that drives customized
data extraction combined with on-the-fly analysis specifically geared
to their individual problems. This is accomplished by combining trace
analysis and in-situ data analysis techniques at runtime, thereby
lifting data reduction to a new level where it IS analysis. With this
approach, modular measurement and analysis components are combined to
selectively extract representative data from production codes in a
problem-specific manner, which enables root cause analysis.
The work demonstrates the feasibility of customized data
extraction and analysis at scale for root cause analysis on current
and forthcoming multi-petascale supercomputers. It thus contributes
to sustain scalable scientific computing into the future up to the largest
scales. Results of this work will be contributed as open-source code
to the research community and beyond as done, allowing other groups to
not only build tools on top of our framework but also contribute their
own components.
Publications:
-
"FuncyTuner: Auto-tuning Scientific Applications With Per-loop
Compilation" by Tao Wang, Nikhil Jain, David
Beckingsale, David Boehme, Frank
Mueller, Todd Gamblin in International Conference
on Parallel Processing (ICPP), Aug 2019.
-
"HiDP: A Hierarchical Data
Parallel Language" by Y. Zhang and F. Mueller in International
Symposium on Code Generation and Optimization (CGO), Feb 2013.
-
Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events
by X. Wu, F. Mueller
in International Conference on Supercomputing (ICS), Jun 2013.
- ScalaJack: Customized Scalable Tracing with in-situ Data Analysis by S. Ananthakrishnan, Frank Mueller in
Euro-Par Conference, Aug 2014.
-
Scalable Tracing of MPI Programs through Signature-Based Clustering Algorithms
by A. Bahmani, F. Mueller
in International Conference on Supercomputing (ICS), Jun 2014.
-
ACURDION: An Adaptive Clustering-based
Algorithm for Tracing Large-scale MPI Applications
by A. Bahmani, F. Mueller
in IEEE Big Data, Oct 2015.
-
"HPC I/O Trace Extrapolation"
by Xiaoqing Luo, Frank Mueller,
Philip Carns, John Jenkins, Robert Latham, Robert Ross, Shane Snyder
, Workshop on Extreme-Scale Programming Tools (ESPT15), Nov 2015.
-
"SparkScore: Leveraging
Apache Spark for Distributed Genomic Inference"
by Amir Bahmani, Alex B. Sibley, Mahmoud Parsian,
Kouros Owzar, Frank Mueller, Workshop on High Performance
Computational Biology (HiCOMB16), May 2016.
- Performance Analysis
of a Multi-Tenant In-memory Data Grid
by Anwesha Das, Frank Mueller, Xiaohui Gu, Arun Iyengar
in IEEE Cloud, Jun/Jul 2016.
-
"Efficient Clustering for Ultra-Scale Application Tracing"
by A. Bahmani, F. Mueller
in Journal of Parallel and Distributed Computing (JPDC), V ??,
No ?, Aug 2016, pages ???, DOI 10.1016/j.jpdc.2016.08.001, accepted.
- Power Tuning HPC Jobs on Power-Constrained Systems
by Neha Gholkar, Frank Mueller, Barry Rountree
in International Conference on Parallel Architecture and
Compilation Techniques (PACT), Sep 2016.
- Benchmark Generation and Simulation at Extreme Scale
by Mahesh Lagadapati, Frank Mueller, Christian Engelmann
in International Symposium on Distributed Simulation and Real Time
Applications (DS-RT), Sep 2016, pages 9-18.
- ScalaIOExtrap: Elastic I/O Tracing and Extrapolation
Xiaoqing Luo, Frank Mueller, Philip Carns, Jonathan Jenkins, Robert Latham, Robert Ross and Shane Snyder (IPDPS), May 2017.
Theses:
-
"Exploiting Data-Parallelism in GPUs"
by Y. Zhang, Ph.D. Thesis, North Carolina State
University, Sep 2012 (last known position: Stone Ridge Technologies, MD)
-
"Scalable Communication Tracing for Performance Analysis of Parallel Applications"
by X. Wu, Ph.D. Thesis, North Carolina State
University, Dec 2012 (last known position: Amazon, WA)
-
"Customized
Scalable Tracing with in-situ Data Analysis"
by Srinash Krishna Ananthakrishnan, M.S. Thesis, North Carolina
State University, May 2013 (last known position: Riverbed Technologies, CA)
-
"ScalaIOExtrap: Elastic I/O Tracing and Extrapolation"
by Xiaoqing Luo, M.S. Thesis, North Carolina State University, Jun 2015
(last known position: TBD)
-
"Scalable Communication Tracing via Clustering"
by A. Bahmani, Ph.D. Thesis, North Carolina State
University, May 2017 (last known position: research staff, Stanford Univ., CA)
Other:
"This material is based upon work supported by the National Science Foundation under Grant No. 1217748."
"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."