An Open Framework for Scalable, Reconfigurable Performance Analysis

Todd Gamblin1, Prasun Ratn2,3, Bronis R. de Supinski3, Martin Schulz3, Frank Mueller2, Robert J. Fowler1, Daniel Reed1

1Renaissance Computing Institute, 2North Carolina State University, 3Lawrence Livermore National Laboratory

image image image

Problem desciption

Size of machines is rapidly increasing (130,000+ processors)
Tools will be overwhelmed with data
Need scalable, online measurement and analysis

ScalaTrace: Reconfigurable Scalable Performance Analysis


ScalaTrace compression framework provides:

ScalaReplay: Replay Using Histogram Timing Annotations


Figure: Bins generated for synthetic input span entire range with similar sample counts

Idea: preserve time in compressed traces

  • Encode time deltas instead of timestamps
  • Create delta histograms automatically
  • Dynamically balance histograms

Number of histograms per record depends on the number of possible call paths

Path-sensitive histograms

  • Time depends on path taken
  • Distinguish histograms by path


MPI_Allreduce (..);       
for (..) {                
  for (..) {              
    MPI_Send (..);        
    MPI_Recv (..);        
  MPI_Barrier (..);       

Sample bimodal distribution from UMT2k collectives

  • Histograms detect imbalances
  • Variable sizes capture variance

Trace sizes (NAS Benchmarks and UMT2K)


The benchmarks fall into three categories:

  • near-constant trace sizes, e.g. DT, EP, LU
  • sub-linear trace sizes, e.g. CG, MG, FT
  • non-scalable trace sizes, e.g. BT, IS, UMT2k

Replay Accuracy (NAS Benchmarks and UMT2K)


The benchmarks fall into three categories:

  • accurate replay: DT, EP, FT, LU, IS, UMT2k
  • Replay inaccurate in MPI time: CG, MG
  • Replay inaccurate in compute time: BT

Evolutionary Load-Balance Analysis with Scalable Data Collection


Idea: Normalize measurements and models based on application semantics

Progress loops

  • Typically outer loops in SPMD codes indicate absolute progress towards some domain-specific goal
  • Basis for comparison of load over time

Effort loops

  • Variable-time loops, represent load
  • Data-dependent execution

Progress instrumentation

  • User marks progress loop explicitly

Effort modeled with code regions

  • Dynamically detected at runtime
  • Split MPI-op trace at collectives and wait operations
  • User can further divide code into phases with instrumentation

Load Balance in ParaDiS

Models dislocation dynamics in crystals

  • Dislocations discretized as nodes and arms
  • Recursive spatial domain decomposition
  • Balancer subdivides nodes/arms along x, then y, then z

Future Directions

Flexible framework for application-specific tools

Near term

image image image