ScalaTrace Overview

ScalaTrace is an MPI tracing toolset that provides orders of magnitude smaller, if not near-constant size, communication traces regardless of the number of nodes while preserving structural information. Combing intra- and inter-node compression techniques of MPI events, the trace tool extracts an application's communication structure. A replay tool allows communication events recorded by our trace tool to be issued in an order-preserving manner without running the original applocation code.

The tool has been tested on BlueGene and x86_x64 platforms for different MPI implementations so far. ScalaTrace may be used in communication tuning, procument of future machines and beyond. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with deterministic MPI call replay are without any precedent.

Detailed overview: Scalable Compression, Replay and Extrapolation of Communication and I/O Traces in Massively Parallel Environments

MPI Introduction:

First of all here's an quick MPI tutorial with examples: MPI Tutorial For more details about parallel computing you can refer to this book online: Book

Compiling MPI programs:

On most platforms you have mpicc, mpicxx and mpif77/mpif90 as the C, C++ and Fortran compilers (a lot of scientific benchmarks are written in fortran and the ScalaTrace framework is written in C/C++). These are usually wrappers around gcc or intel compilers. On BG/L we normally use the IBM compilers (but you can use the gcc compilers if you want). The corresponding compilers on BG/L are mpixlc, mpixlcxx and mpixlf77 (which are wrappers around blrts_xl* )


$ mpixlc -o main main.c_

Running MPI programs:

Many supercomputers and clusters have batch systems - you submit jobs to a queue and the scheduler takes care to run them. The normal sequence would be as follows:


$ cqsub -t <time> -n ./main _

(on BG/L you need to specify how long the expected run time is (in minutes) after which if the program is still running it will be terminated) This command will queue your job for the scheduler to run. It will also output a job-id e.g. 20495. Once the program is terminated you can check the stdout and stderr output in files named e.g. 20495.output and 20495.error If your program segfaults and crashes you'll also get one core file per segfaulting node named core.n where n is the node id.

Check status of the queue
$ cqstat _

Delete a job
$ cqdel <job-id> _

Building ScalaTrace library:

To build the library :-
$ cd record
$ make

There are three libraries that will be built by default in record/lib These are :
You might have to modify record/Makefile depending on which version(s) you want to compile. If you want to change compilers, you can edit record/config/Makefile.config. If you want to enable/disable timing deltas edit record/libsrc/Makefile.libsrc
The main library source is in record and common. Please read README and BUILD files in record .

To build samples:
 $make test 
You can select which samples to build by editing record/tests/Makefile. This step compiles a sample MPI program and links it with the library we built above.

Running samples:

Run them as you would run any MPI program. Once the program runs successfully it will generate a folder called recorded_ops_n where n the number of nodes that you ran on. In this folder will be trace files named 0, 1, ... , n. If you link with -lglob, you will have only one file named 0. There will also be a file called "times" which has the running time information.

Reading trace files:

Reading trace files directly can be cumbersome. You can go to rcat dir and do a 'make rcat'. The rcat tool is useful for transforming the trace into a more readable format. See rcat -h for options (-p, -e and to some extent -c are the most useful ones I find)


Frank Mueller
Martin Schulz
Bronis de Supinski
Prasun Ratn
Todd Gamblin
Mike Noeth
Karthik Vijayakumar
Sandeep Budanur Ramanna