In this path of inquiry, we use the nm tool to examine the symbols present in our improperly linked executables. As baselines from two different directions, we consider:
Initial observations suggest that the Fortran programs (from both NAS Parallel Benchmarks and the hellof example) are not linking properly against the umpire library. Comparing symbol dumps of hello.exe and hellof.exe. In the C version (hello.exe), the executable contains six "sets" of MPI symbols; five from the umpire wrapper library (all of which resolve to file/line numbers in the umpire sources) and one set from PMPI, the profiling layer provided by the MPI implementation (as expected). The Fortran version, on the other hand, contains only the symbols from the default (non-profiling) MPI implementation.
This seems to indicate that the linking process for the Fortran programs skips all of the symbols in the umpire library; Fortran seems to expect (from experiments with mpiP) a trailing underscore after every exported symbol name, while the umpire library provides only bare symbols. The first attempt at a fix was to patch wrapper_engine/mpi_wrappers.c at line 258:
--- wrapper-engine/mpi_wrappers.c 2006-11-06 21:33:24.000000000 -0500 +++ /home/travitc/tmp/mpi_wrappers.c 2006-11-06 21:55:34.768025568 -0500 @@ -255,7 +255,7 @@ } lcname[i] = '\0'; - fprintf (wrapperFile, "\nextern void\n%s( ", lcname); + fprintf (wrapperFile, "\nextern void\n%s_( ", lcname); /* the only function whose parameters differ from the c version */ if(!strcmp(lcname, "mpi_init")) {
With this change, libumpire.a provides these symbols (note the trailing underscores), and hellof.exe gives this symbol dump; all of the expected symbols are present in the Fortran executable (the PMPI profiling layer from the MPI implementation is linked in, and there are several sets of symbols from the umpire library). When running the resulting executable, however, the only output is:
Initializing Initializing ** Address Error ** End of diagnostics ** Address Error ** End of diagnostics
The most direct analysis we can do is to compare the sources of each "set" of MPI symbols in the original (and working) hello.exe to the symbols from the two versions of hellof.exe:
Symbol | Source in hello.exe | Source in hellof.exe (Original) | Source in hellof.exe (Modified) |
---|---|---|---|
MPID_* | MPICH | MPICH | MPICH |
MPIR_* | MPICH | MPICH | MPICH |
MPI_* | umpire | MPICH | umpire |
PMPI_* | MPICH | Not Present | MPICH |
gmpi_* | MPICH | MPICH | MPICH |
gwrap_MPI_* | umpire | Not Present | umpire |
mpi_*_ | Not Present | MPICH | umpire |
umpi_* | umpire | Not Present | umpire |
Note that we know that symbols come from MPICH if nm -l cannot resolve a file name and line number for a given symbol (as MPICH does not provide debugging symbols in the default linking configuration, whereas the test umpire libraries do). This confirms our original suspicions about the linking process; however the "Address Error" is still an issue. Examining the differences between the symbols dumps of libmpiP.a and the modified libumpire.a is not particularly enlightneing; both provide a number of their own symbols internally and the public MPI_* and mpi_*_ interfaces, allowing the PMPI_* symbols to be resolved at link time against the MPICH profiling library.
Since the two symbol dumps are so similar between working C and not working Fortran versions of the hello program and since there are no conspicuous linker errors or warnings, the only avenues of inquiry that really spring to mind are:
We finally tracked down the real problem; the wrapper being generated for mpi_init_ was passing an integer pointer to the PMPI_Init function. This caused the MPI setup steps to fail and the executables to abort. The basic fix is to add a function to parse the executable's command line parameters to the umpire library (I chose to put it in libsrc/util.c) and expose a few symbols to the mpi_init_ wrapper function. The change to the wrapper must happen in the wrapper generator. A full listing of changes in unified diff format is available. The basic idea of the command line parsing function is demonstrated here; we read through /proc/self/cmdline and parse it into an array of strings, exposing the resulting table through an externally linkable symbol.
This line of reasoning became apparent after the experience with the f2c utility and attempts to work with the differing signatures of MPI_Init in the conversion from Fortran to C. The same type of address error arose when trying to get away with passing junk to MPI_Init and the same eventual solution was reached.
This side project was successful and we can now generate traces for the Fortran benchmarks. This patch can be applied at the root of the record distribution with the command:
patch -p1 < /path/to/patch