Exploiting Hyper-threading and MPI

CSC591c Course Project

Nikola Vouk (nvouk@ncsu.edu)

Frank Castaneda (fjcastan@ncsu.edu)

April 10, 2003

Goals:

The function of this project is to exploit the hyper-threading architecture in the Intel Xeon processor and the Experimental Linux 2.5 kernel in parallelizing the communication overhead of a work unit, while the main work thread does actual work. The idea is that the functional computational units of the processor are shared amongst the threads and the functions run in parallel. The key is to find a main work unit like a render, compression or computation that shares the processor with the necessary send/receive overhead in node-node communication.

Software Setup:

Redhat Linux with latest kernel 2.5 that supports hyper-threading.

PAPI for performance counter optimization

MPI for message passing that bypasses the kernel

ASCI Benchmark Experiment

As part of the project, we have to install an ASCI benchmark on the class cluster. We have decided to install sPPM.

Hardware Setup:

The target machines are IBM X232 with dual Intel 2.0 Ghz Xeon MP processors

Experiment Setup:

A Master Node that send data to a slave hyper-threaded node

The hyper threaded node runs a long running application like a render that can pipe-line i/o calls with communication to the master node. The i/o should be long enough for noticeable delay to occur if it had been done sequentially (Amdahlís Law).

The hyper-thread will be spin-locking on a global variable to keep in memory and then performs IO for the main thread when called

The main thread will be doing some sort of calculation work, probably from the Spec suite of benchmarks. These benchmarks allow us to target certain aspects and ALU units on the processor specifically.

Results

We will test the system in a hyper-threaded and non-hyper-threaded environment. We expect to see major improvement if there is a lot of communication overhead. The hyper-threading splits the architecture including the buffer access when doing processing. Over small runs, the overhead will not be beneficial, but with large long term renders or compressions, the pipelining benefit of sending and receiving data will yield higher performance. Our application will attempt to utilize the SIMD units to maximize the performance. The limitation will be seen in code that uses the same internal buffers. [1]

Questions and Concerns:

What is the actual processor overhead of an isend/irecv?
Does isend/irecv cause a context switch?
Does isend/irecv use DMA and what actual processor involvement is required?

We think that there is overhead of packetizing the data, even if it is in user space that does use some cpu resources and is not completely handled by the NIC.
Does DMA fit in this equation?

How many ALU/FP/SIMD units are available?
Cache misses due to data size could cause instances where the working thread may stall and allow hyper-thread to run alone on cpu.
Cache thrashing where hyper-thread and working thread working on data sets that are both not in memory

Update 4/24/2003

The latest update on the project is available here.

Final Report

The final report of the project is available here.

Website

http://www4.ncsu.edu/~nvouk/exploitinghyper.html

References:

The IA-32 Intel Architecture Software Developerís Manual, Volume 1: Basic Architecture (Order Number 245470).
ftp://download.intel.com/design/Pentium4/manuals/24547011.pdf

The IA-32 Intel Architecture Software Developerís Manual, Volume 2: Instruction Set Reference (Order Number 245471).
ftp://download.intel.com/design/Pentium4/manuals/24547111.pdf
The IA-32 Intel Architecture Software Developerís Manual, Volume 3: System Programming Guide (Order Number 245472).
ftp://download.intel.com/design/Pentium4/manuals/24547210.pdf

D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, "Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor," 23rd Annual International Symposium on Computer Architecture, May 1996.
http://citeseer.nj.nec.com/cache/papers/cs/7286/http:zSzzSzwww.csrd.uiuc.eduzSz~ece412zSzpaperszSztullsen_ISCA96.pdf/tullsen96exploiting.pdf
Download of performance libs
http://www.intel.com/software/products/global/eval.htm#perflib
Pentium optimized libraries
http://www.intel.com/software/products/ipp/ipp30/index.htm
Detailed Article on Hyper-threading in the Pentium Xeon
http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm
Intel Processor Programming Manuals
http://developer.intel.com/design/Pentium4/manuals/
Pentium 4 and the G4e: architectural Comparison
http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-6.html
IBM hyper-threading document
https://mail.gininet.com/Redirect/www-106.ibm.com/developerworks/linux/library/l-htl/
Spec C 2000 Test Suite
http://www.specbench.org/osg/cpu2000/CINT2000/
|