This work addresses problems in exploiting the memory bandwidth of shared-memory multiprocessors (SMPs) for scientific applications. For contemporary high-performance clusters of SMPs, it has been found that a number of scientific applications utilizing a mixed mode of MPI+OpenMP are performing worse than when relying on MPI, only. Considering that the architectural model of SMPs seems to be a close fit to the OpenMP threading model, this performance gap seems particularly surprising. The objective of this proposal is to determine the sources of inefficiencies in utilizing memory hierarchies for threaded programs vs. parallel processes and to assist the programmer in alleviating these problems. The methodology to perform this analysis relies on binary rewriting.