Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing platforms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a xed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientic programs are memorybound, an intelligent page-placement policy that allocates pages closer to the requesting processor can signicantly reduce number of cycles required to access memory. We show that such a policy can lead to signicant savings in wall-clock execution time. In this paper, we introduce a novel hardware-assisted page placement scheme based on automated proling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efciently extract an approximate trace of memory accesses. This information is used to decide page afnity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation. We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware prole sources with respect to the cost (e.g., time to trace, number of records in prole) vs. the accuracy of the prole and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses. Our experiments show that our method can efciently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time pro- ling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the rst evaluation on a real machine of a completely user mode interrupt-driven proleguided page placement scheme that requires no special compiler, operating system or network interconnect support.