We have begun solving a few issues related to the Fortran benchmarks; some issues still remain, however.
We do not believe that we were ever able to get any form of cross-node compression traces (debug or otherwise) from any of the Fortran benchmarks (though IS seems to generally work).
We traced at least part of the problem to the instrumentation of the MPI_Finalize routine; particularly, the crash seems to come within transfer.c:201:void flat_addr(void *fqueue_offsets). The crash in MPI_Finalize seems to prevent full cross-node traces from completing.
While we did solve the problem with the incorrect Fortran linking of the umpire library, there still seem to be some Fortran specific problems (though they may also be affecting IS and simply manifesting in a different way).
Particularly, none of the Fortran benchmarks run to completion, as mentioned above, due to an apparent crash in MPI_Finalize. This is true even of the benchmarks (such as EP) that apparently exhibited acceptable compression profiles in the original research. This may manifest itself in IS as a hang (but not a crash) after printing out its final two line blurb, though that may also be a different issue.
The LU benchmark remains intractible (we focused our attention primarily on CG because its implementation is more compact and its results as expressed in the original problem statement appeared to be something of a superset of the problems in LU).
The crash seems to consistently appear in the first iteration of LU (so we know that the problem is probably unrelated to initialization). Some anomolies were observed during earlier debugging in some collective operations, so the problems with LU could be related to those.
We can now get partial traces of CG (up to the cross-node compression stage) for small runs. 1, 2, and 4 nodes work fairly consistently; 8 and higher consistently fail some time in the middle of their iterations. The point of failure seems to vary between them (with 8 usually failing earlier than 16).
Despite the above, however, we were able to get some partial debug traces (with RSD_DEBUG output) including at least some task-level compression.
Three major factors seem to come up with great frequency:
The last two seem to be related to loop merging failures and seem to recur with great frequency. This pair of errors gives us at least a hint of a course to pursue in addressing the underlying scaling problem. Both originate in rsd_queue.c:476:void compress_rsd, as expected.
A number of the errors discovered in previous reports seem to have been present for all cases, however, good compression ratios on certain benchmarks (ie IS) seem to have masked them (the buffer overflows in the debug tracing code, for instance, were only triggered strongly when large traces from unscalable were run).