MPI Trace Compression Tuning Project

Participants

Alex Balik and Tristan Ravitch

Background

In massively parallel applications, communication is often the cause of poor scalability. In order to better understand the communication patterns used in parallel applications and hopefully improve them, various tools such as mpiP and Vampir have been developed to record MPI usage. Unfortunately, these tools generally suffer from one of two problems:

Lossy replay: In order to keep trace sizes scalable, some MPI information (such as a complete temporal ordering of events) is lost. mpiP falls in this category.
Non-scalable storage requirements: Tools that maintain lossless traces of MPI events require each node to store its own trace file. This is not very scalable since increasing the number of nodes will cause an increase in the amount of space required to store the traces. Vampir falls in this category.

In an attempt to get the benefits of both types of MPI tools, Noeth et al. have developed a tool to compress lossless MPI trace files into a single file with hopefully a constant or near constant size regardless of the number of nodes [1]. This is achieved by using both intra-node and inter-node compression techniques. Intra-node compression uses regular section descriptors (RSDs) to represent repeated sequences of MPI calls (due to loops) in constant size. Stencil identification is also used to compress sequences of MPI calls that communicate in a set pattern (for example in a 2D layout a node might repeatedly communicate with its neighbors to the north, south, east, and west). Inter-node compression takes all the trace files generated during program execution and compresses them down to a single file, grouping common MPI calls along the way.

Problem Description

There are three groups of benchmarks in the NAS Parallel Benchmark suite, grouped according to the performance of the MPI Trace Compression utility:

Those that generate traces of constant size, regardless of the number of nodes participating in the benchmark. An example is IS.
Those that generate traces that grow in a sub-linear fashion (somewhere between constant and linear) as the number of nodes participating in a computation increase. Examples are LU and MG.
Those that generate traces that grow in a super-linear fashion (grow faster than linearly) as the number of participating nodes increases. Examples are CG, BT, and FT.

The poor scaling affects both trace output size and the time required to write out traces. Both of these will need to be addressed. The effects of these scaling anomolies also show in both task- and cross-node-level compression.

The fact that there are two distinct groups of performance anomoly suggest that there are at least two different problems at work (or, in the best case, that the usage of one particular idiom is aggrivated by some unusual pattern in CG, BT, and FT). An initial guess would say that there is a common problem shared by both groups, and that the super-linear group has additional odd usage pattern(s).

Another major problem is that the NAS code is all in Fortran.

Plan of Attack

There are four immediately obvious vectors from which to approach the problem:

Identify unusual MPI usage: A good place to start would be to simply survey the code for abnormal usage of the MPI API. This would also be a good way to become familiar with the code.
The stencil identifier: We know, at the very least, that CG performs stencil operations in an exponential pattern. Modification to the stencil identification routines may show interesting results and seems to be a reasonable starting point. That segment of the code also seems more immediately accessible.
The task compression engine: Depending on the nature of the current unscalability, the task compressor would be the next logical point of interest (though the stencil work may affect this as well).
The inter-node compression engine: Similarly to the task compression (but perhaps more related to the stencil identification) is the inter-node compression engine; intuition tells us that first addressing the stencil identification code may help a great deal with the inter-node compression results, so that should be done before looking at this higher level in the operation stack.
Otherwise: If none of the above solves the problem (hopefully they will at least help), then we will need to more closely examine the code.

We will have to become familiar enough with Fortran constructs to at least be able to examine loop constructs and MPI calls (and to where those MPI calls point).

Timeline

Week	Activity	Summary
1	Identify all of the MPI calls (and strides) in LU and CG. Find the stencil identification code in the MPI Trace Compression utility (as well as the code for the task compression and the cross-node compression). Examine the problematic traces on LU and CG (compared to IS). Goal - Become familiar with the utility and benchmark code, identify areas for improvement.	We primarily tackled linking problems and identified probable areas that will need to be modified after we manage to acquire traces for our two benchmarks. We have two separate paths of inquiry in addressing these linking errors: Fixing the Fortran Linking Converting the Fortran (defunct) We managed to fix the Fortran linking through the process described in the first link above.
2	Alex/Tristan - Investigate intra-node compression to determine if there is any room for improvement. Alex - Determine if intra-node compression can be modified to reduce the trace sizes from LU Tristan - The same for CG Tristan - Hash out a few comparitive traces (to see the growth curves at low node counts) for IS, LU, and CG (now that we can). Accompany with nice charts. Goal - Determine whether CG and LU can be improved with just changes to stencil code.	The last fix we found allowed us to get production traces from all of the benchmarks; debug traces still seemed problematic (the benchmarks would complete but segfault at the end, before outputting anything). With a very limited window, the traces sometimes completed but were less than informative (due to the limited window size). The first result below fixes some of this limitation and allows us to get partial debug traces (task-level compressed only). Obtaining Debug Traces Remaining Issues Debugging Methods Coming Soon
3	Finish remaining work on the stencil code. Move on to task compression code if there are still issues. Goal - Determine if changes to the task-level compression can improve performance on CG and LU.	Coming Soon
4	Finish any work on task compression Move on to cross-node compression code if issues persist. Goal - Determine if changes to the cross-node compression can improve performance on CG and LU.	Coming Soon

References

M. Noeth, F. Mueller, M. Schulz, and B. de Supinski. Scalable Compression and Replay of Communication Traces in Massively Parallel Environments.
A related thesis
Assignment
NAS Parallel Benchmarks (2.4-MPI)
Fortran language
Fortran 90 introduction
LU Decomposition (LU)
Conjugate Gradient (CG)