Cache coherent non-uniform memory architectures (ccNUMA)
constitute an important class of high-performance computing platforms.
Contemporary ccNUMA systems, such as the SGI Altix,
have a large number of nodes, where each node consists of a small
number of processors and a xed amount of physical memory. All
processors in the system access the same global virtual address
space but the physical memory is distributed across nodes, and
coherence is maintained using hardware mechanisms. Accesses to
local physical memory (on the same node as the requesting processor)
results in lower latencies than accesses to remote memory
(on a different node). Since many scientic programs are memorybound,
an intelligent page-placement policy that allocates pages
closer to the requesting processor can signicantly reduce number
of cycles required to access memory. We show that such a policy
can lead to signicant savings in wall-clock execution time.
In this paper, we introduce a novel hardware-assisted page
placement scheme based on automated proling. The placement
scheme allocates pages near processors that most frequently access
that page. The scheme leverages performance monitoring capabilities
of contemporary microprocessors to efciently extract an approximate
trace of memory accesses. This information is used to
decide page afnity, i.e., the node to which the page is bound. Our
method operates entirely in user space, is widely automated, and
handles not only static but also dynamic memory allocation.
We evaluate our framework with a set of multi-threaded benchmarks
from the NAS and SPEC OpenMP suites. We investigate the
use of two different hardware prole sources with respect to the
cost (e.g., time to trace, number of records in prole) vs. the accuracy
of the prole and the corresponding savings in wall-clock
execution time. We show that long-latency loads provide a better
indicator for page placement than TLB misses.
Our experiments show that our method can efciently improve
page placement, leading to an average wall-clock execution time
saving of more than 20% for our benchmarks, with a one-time pro-
ling overhead of 2.7% over the overall original program wallclock
time. To the best of our knowledge, this is the rst evaluation on a
real machine of a completely user mode interrupt-driven proleguided
page placement scheme that requires no special compiler,
operating system or network interconnect support.