Debugging MPI Applications
Throughout this project, we ended up doing a great deal of
debugging with the venerable printf family of functions;
in general, this method of debugging is tedious and involves a great
deal of guesswork. The need for this method arises because our
cluster does not have a distributed debugger available. We found
a partial workaround:
Debugging MPI Applications with GDB
GDB is not suitable for debugging distributed applications in
the normal method (launching the application under
GDB). Instead, the debugger must attach to an already running
process (so that proper initialization can take place and
the real processes can be inspected. Unfortunately, this requires
one instance of GDB per task to be debugged. The general process
is as follows:
-
(Optional) Instrument the code; either in the MPI_Init_post
wrapper or in the program itself, add a line similar to the following:
printf("Rank=%d,pid=%d\n", my_rank, getpid());. Also
printing out the hostname (via gethostname()) can be
useful when running on multiple nodes.
-
Further, instrument the code with a call to the
sleep() system call for a small amount of time (60 seconds,
for example) - long enough to attach to the process.
-
The easiest way to proceed is to utilize the -machinefile
argument to mpirun to limit all of the tasks to run on
a single host; this simplifies finding them and then
attaching.
-
After the task is running (and has been initialized), it will sleep
due to the instrumentation; in the GDB shell, run the command:
(gdb) attach <pid>
(add breakpoints, etc)
(gdb) continue