ScalaJack: Scalable Trace-Based Tools for In-Situ Data Analysis of HPC Applications

Production codes on supercomputers are struggling to remain scalable each time the processor core count increases by a factor of 10, even though they run efficiently at smaller scale. But root cause diagnosis fails at petascale since (1) symptoms of performance problems can be subtle, (2) only few metrics can be efficiently collected and (3) tools can only feasibly record a small subset of even these metrics.

This work addresses these problems by creating a framework that allows application developers to focus on data analysis that drives customized data extraction combined with on-the-fly analysis specifically geared to their individual problems. This is accomplished by combining trace analysis and in-situ data analysis techniques at runtime, thereby lifting data reduction to a new level where it IS analysis. With this approach, modular measurement and analysis components are combined to selectively extract representative data from production codes in a problem-specific manner, which enables root cause analysis.

The work demonstrates the feasibility of customized data extraction and analysis at scale for root cause analysis on current and forthcoming multi-petascale supercomputers. It thus contributes to sustain scalable scientific computing into the future up to the largest scales. Results of this work will be contributed as open-source code to the research community and beyond as done, allowing other groups to not only build tools on top of our framework but also contribute their own components.

Publications: Theses: Other:
"This material is based upon work supported by the National Science Foundation under Grant No. 1217748."

"Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation."