Exploiting
Hyper-threading and MPI
CSC591c Course
Project
Nikola Vouk (nvouk@ncsu.edu)
Frank Castaneda
(fjcastan@ncsu.edu)
April 10, 2003
Goals:
The function of this project is to exploit the
hyper-threading architecture in the Intel Xeon processor and the Experimental
Linux 2.5 kernel in parallelizing the communication overhead of a work unit,
while the main work thread does actual work. The idea is that the functional
computational units of the processor are shared amongst the threads and the
functions run in parallel. The key is to find a main work unit like a render,
compression or computation that shares the processor with the necessary
send/receive overhead in node-node communication.
Software Setup:
Redhat Linux with latest kernel 2.5 that supports
hyper-threading.
PAPI for performance counter optimization
MPI for message passing that bypasses the kernel
ASCI Benchmark Experiment
As part of the project, we have to install an ASCI benchmark
on the class cluster. We have decided to install sPPM.
Hardware Setup:
The target machines are IBM X232 with dual Intel 2.0 Ghz
Xeon MP processors
Experiment Setup:
A Master Node that send data to a slave hyper-threaded node
The hyper threaded node runs a long running application like
a render that can pipe-line i/o calls with communication to the master node.
The i/o should be long enough for noticeable delay to occur if it had been done
sequentially (Amdahlís Law).
The hyper-thread will be spin-locking on a global variable
to keep in memory and then performs IO for the main thread when called
The main thread will be doing some sort of calculation work,
probably from the Spec suite of benchmarks. These benchmarks allow us to target
certain aspects and ALU units on the processor specifically.
Results
We will test the system in a hyper-threaded and
non-hyper-threaded environment. We expect to see major improvement if there is
a lot of communication overhead. The hyper-threading splits the architecture
including the buffer access when doing processing. Over small runs, the
overhead will not be beneficial, but with large long term renders or
compressions, the pipelining benefit of sending and receiving data will yield
higher performance. Our application will attempt to utilize the SIMD units to
maximize the performance. The limitation will be seen in code that uses the
same internal buffers. [1]
Questions and Concerns:
- What is the actual processor
overhead of an isend/irecv?
- Does isend/irecv cause a context
switch?
- Does isend/irecv use DMA and
what actual processor involvement is required?
- We think that there is
overhead of packetizing the data, even if it is in user space that does
use some cpu resources and is not completely handled by the NIC.
- Does DMA fit in this
equation?
- How many ALU/FP/SIMD units are
available?
- Cache misses due to data size
could cause instances where the working thread may stall and allow
hyper-thread to run alone on cpu.
- Cache thrashing where
hyper-thread and working thread working on data sets that are both not in
memory
Update 4/24/2003
The latest update on the project is available
here.
Final Report
The final report of the project is available
here.
Website
http://www4.ncsu.edu/~nvouk/exploitinghyper.html
References:
- The IA-32 Intel Architecture Software
Developerís Manual, Volume 1: Basic Architecture (Order Number
245470).
ftp://download.intel.com/design/Pentium4/manuals/24547011.pdf
- The IA-32 Intel Architecture Software
Developerís Manual, Volume 2: Instruction Set Reference (Order
Number 245471).
ftp://download.intel.com/design/Pentium4/manuals/24547111.pdf
- The IA-32 Intel Architecture Software
Developerís Manual, Volume 3: System Programming Guide (Order
Number 245472).
ftp://download.intel.com/design/Pentium4/manuals/24547210.pdf
- D. Tullsen, S. Eggers, J.
Emer, H. Levy, J. Lo, and R. Stamm, "Exploiting choice: Instruction
fetch and issue on an implementable simultaneous multithreading
processor," 23rd Annual International Symposium on Computer
Architecture, May 1996.
http://citeseer.nj.nec.com/cache/papers/cs/7286/http:zSzzSzwww.csrd.uiuc.eduzSz~ece412zSzpaperszSztullsen_ISCA96.pdf/tullsen96exploiting.pdf
- Download of performance libs
http://www.intel.com/software/products/global/eval.htm#perflib
- Pentium optimized libraries
- http://www.intel.com/software/products/ipp/ipp30/index.htm
- Detailed Article on
Hyper-threading in the Pentium Xeon
http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm
- Intel Processor Programming
Manuals
http://developer.intel.com/design/Pentium4/manuals/
- Pentium 4 and the G4e:
architectural Comparison
http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-6.html
- IBM hyper-threading document
https://mail.gininet.com/Redirect/www-106.ibm.com/developerworks/linux/library/l-htl/
- Spec C 2000 Test Suite
http://www.specbench.org/osg/cpu2000/CINT2000/
|