Alex Balik and Tristan Ravitch
In massively parallel applications, communication is often the cause of poor scalability. In order to better understand the communication patterns used in parallel applications and hopefully improve them, various tools such as mpiP and Vampir have been developed to record MPI usage. Unfortunately, these tools generally suffer from one of two problems:
In an attempt to get the benefits of both types of MPI tools, Noeth et al. have developed a tool to compress lossless MPI trace files into a single file with hopefully a constant or near constant size regardless of the number of nodes [1]. This is achieved by using both intra-node and inter-node compression techniques. Intra-node compression uses regular section descriptors (RSDs) to represent repeated sequences of MPI calls (due to loops) in constant size. Stencil identification is also used to compress sequences of MPI calls that communicate in a set pattern (for example in a 2D layout a node might repeatedly communicate with its neighbors to the north, south, east, and west). Inter-node compression takes all the trace files generated during program execution and compresses them down to a single file, grouping common MPI calls along the way.
There are three groups of benchmarks in the NAS Parallel Benchmark suite, grouped according to the performance of the MPI Trace Compression utility:
The poor scaling affects both trace output size and the time required to write out traces. Both of these will need to be addressed. The effects of these scaling anomolies also show in both task- and cross-node-level compression.
The fact that there are two distinct groups of performance anomoly suggest that there are at least two different problems at work (or, in the best case, that the usage of one particular idiom is aggrivated by some unusual pattern in CG, BT, and FT). An initial guess would say that there is a common problem shared by both groups, and that the super-linear group has additional odd usage pattern(s).
Another major problem is that the NAS code is all in Fortran.
There are four immediately obvious vectors from which to approach the problem:
We will have to become familiar enough with Fortran constructs to at least be able to examine loop constructs and MPI calls (and to where those MPI calls point).
Week | Activity | Summary |
---|---|---|
1 |
Goal - Become familiar with the utility and benchmark code, identify areas for improvement. |
We primarily tackled linking problems and identified probable areas that will need to be modified after we manage to acquire traces for our two benchmarks. We have two separate paths of inquiry in addressing these linking errors: We managed to fix the Fortran linking through the process described in the first link above. |
2 |
Goal - Determine whether CG and LU can be improved with just changes to stencil code. |
The last fix we found allowed us to get production traces
from all of the benchmarks; debug traces still seemed
problematic (the benchmarks would complete but segfault
at the end, before outputting anything). With a very
limited window, the traces sometimes completed but were
less than informative (due to the limited window size).
The first result below fixes some of this limitation and allows
us to get partial debug traces (task-level compressed only).
Coming Soon |
3 |
Goal - Determine if changes to the task-level compression can improve performance on CG and LU. |
Coming Soon |
4 |
Goal - Determine if changes to the cross-node compression can improve performance on CG and LU. |
Coming Soon |