ScalaTrace

ScalaTrace Overview

ScalaTrace is an MPI tracing toolset that provides orders of magnitude smaller, if not near-constant size, communication traces regardless of the number of nodes while preserving structural information. Combing intra- and inter-node compression techniques of MPI events, the trace tool extracts an application's communication structure. A replay tool allows communication events recorded by our trace tool to be issued in an order-preserving manner without running the original applocation code.

The tool has been tested on BlueGene and x86_x64 platforms for different MPI implementations so far. ScalaTrace may be used in communication tuning, procument of future machines and beyond. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with deterministic MPI call replay are without any precedent.

Detailed overview: Scalable Compression, Replay and Extrapolation of Communication and I/O Traces in Massively Parallel Environments

MPI Introduction:

First of all here's an quick MPI tutorial with examples: MPI Tutorial For more details about parallel computing you can refer to this book online: Book

Compiling MPI programs:

On most platforms you have mpicc, mpicxx and mpif77/mpif90 as the C, C++ and Fortran compilers (a lot of scientific benchmarks are written in fortran and the ScalaTrace framework is written in C/C++). These are usually wrappers around gcc or intel compilers. On BG/L we normally use the IBM compilers (but you can use the gcc compilers if you want). The corresponding compilers on BG/L are mpixlc, mpixlcxx and mpixlf77 (which are wrappers around blrts_xl* )

Compile:

$ mpixlc -o main main.c_

Running MPI programs:

Many supercomputers and clusters have batch systems - you submit jobs to a queue and the scheduler takes care to run them. The normal sequence would be as follows:

Run:

$ cqsub -t <time> -n ./main _

(on BG/L you need to specify how long the expected run time is (in minutes) after which if the program is still running it will be terminated) This command will queue your job for the scheduler to run. It will also output a job-id e.g. 20495. Once the program is terminated you can check the stdout and stderr output in files named e.g. 20495.output and 20495.error If your program segfaults and crashes you'll also get one core file per segfaulting node named core.n where n is the node id.

Check status of the queue
$ cqstat _

Delete a job
$ cqdel <job-id> _

Building ScalaTrace library:

To build the library :-
$ cd record
$ make

There are three libraries that will be built by default in record/lib These are :

libdump.a (flat traces, no compression)
libnode.a(compression on node only) and
libglob.a (global compression)

You might have to modify record/Makefile depending on which version(s) you want to compile. If you want to change compilers, you can edit record/config/Makefile.config. If you want to enable/disable timing deltas edit record/libsrc/Makefile.libsrc
The main library source is in record and common. Please read README and BUILD files in record .

To build samples:

 $make test

You can select which samples to build by editing record/tests/Makefile. This step compiles a sample MPI program and links it with the library we built above.

Running samples:

Run them as you would run any MPI program. Once the program runs successfully it will generate a folder called recorded_ops_n where n the number of nodes that you ran on. In this folder will be trace files named 0, 1, ... , n. If you link with -lglob, you will have only one file named 0. There will also be a file called "times" which has the running time information.

Reading trace files:

Reading trace files directly can be cumbersome. You can go to rcat dir and do a 'make rcat'. The rcat tool is useful for transforming the trace into a more readable format. See rcat -h for options (-p, -e and to some extent -c are the most useful ones I find)

People

Frank Mueller
Martin Schulz
Bronis de Supinski
Prasun Ratn
Todd Gamblin
Mike Noeth
Karthik Vijayakumar
Sandeep Budanur Ramanna

Downloads

ScalaTrace V4 (adds several custering options)
ScalaTrace V3 (adds new extrapolation of MPI, MPI-IO and POSIX I/O traces, further improved MPI-IO and POSIX I/O)
ScalaTrace V2.2 (adds new elastic compression, redesign MPI-IO and POSIX I/O tracing)
ScalaTrace V0.5 (adds MPI-IO and POSIX I/O tracing with lossless and lossy/histogram recordings)
ScalaMem V0.1 (Scalable record and replay framework for memory references under x86 with PIN)

Publications

ScalaJack: Customized Scalable Tracing with in-situ Data Analysis by S. Ananthakrishnan, Frank Mueller in Euro-Par Conference, Aug 2014 (accepted).
Scalable Tracing of MPI Programs through Signature-Based Clustering Algorithms by A. Bahmani, F. Mueller in International Conference on Supercomputing, Jun 2014 (accepted).
Elastic and Scalable Tracing and Accurate Replay of Non-Deterministic Events by X. Wu, F. Mueller in International Conference on Supercomputing, Jun 2013, accepted.
"ScalaBenchGen: Auto-Generation of Communication Benchmark Traces" by X. Wu, V. Deshpande, F. Mueller, in International Parallel and Distributed Processing Symposium, May 2012 DOI 10.1109/IPDPS.2012.114.
"Probabilistic Communication and I/O Tracing with Deterministic Replay at Scale" by Xing Wu, Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, Philip C. Roth in International Conference on Parallel Processing, Sep 2011, pages 196-205.
Automatic Generation of Executable Communication Specifications from Parallel Applications by X. Wu, F. Mueller, S. Pakin in International Conference on Supercomputing, Jun 2011, pages 12-21.
"ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Programs" by X. Wu, F. Mueller in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2011, pages 113-122.
"ScalaTrace: Tracing, Analysis and Modeling of HPC Codes at Scale" by F. Mueller, X. Wu, M. Schulz, B. de Supinski, T. Gamblin in Para 2010: State of the Art in Scientific and Parallel Computing (invited), Springer LNCS 7133, eds. K. Jonasson, Jun 2010, pages 410-418.
"ScalaTrace: Scalable Compression and Replay of Communication Traces in High Performance Computing" by M. Noeth and P. Ratn and F. Mueller and M. Schulz and B. de Supinski, Journal of Parallel and Distributed Computing, V ?, No ?, accepted Sep 2008, pages ???.
"Preserving Time in Large-Scale Communication Traces" by P. Ratn and F. Mueller and M. Schulz and B. de Supinski in International Conference on Supercomputing, Jun 2008, pages 46-55.
"Scalable Compression and Replay of Communication Traces in Massively Parallel Environments" by M. Noeth and F. Mueller and M. Schulz and B. de Supinski in International Parallel and Distributed Processing Symposium, Mar 2007, Best Paper Award.
"An Open Infrastructure for Scalable, Reconfigurable Analysis" by B. de Supinski and Rob Fowler and Todd Gamblin and F. Mueller and P. Ratn and M. Schulz in International Workshop on Scalable Tools for High-End Computing, Jun 2008, pages 39-50.
"An Open Framework for Scalable, Reconfigurable Performance Analysis" by T. Gamblin, P. Ratn, B. de Supinski, M. Schulz, F. Mueller, R. Fowler and D. Reed, refereed poster at Supercomputing, Nov 2007. ( html)
"Scalable Compression and Replay of Communication Traces in Massively Parallel Environments" by M. Noeth and F. Mueller and M. Schulz and B. de Supinski, refereed poster at Supercomputing, Nov 2006.