Exploiting
Hyper-threading and MPI
CSC591c Course
Project
Nikola Vouk (nvouk@ncsu.edu)
Frank Castaneda
(fjcastan@ncsu.edu)
April 10, 2003
 
Goals:
 
The function of this project is to exploit the
hyper-threading architecture in the Intel Xeon processor and the Experimental
Linux 2.5 kernel in parallelizing the communication overhead of a work unit,
while the main work thread does actual work. The idea is that the functional
computational units of the processor are shared amongst the threads and the
functions run in parallel. The key is to find a main work unit like a render,
compression or computation that shares the processor with the necessary
send/receive overhead in node-node communication.
 
 
Software Setup:
 
Redhat Linux with latest kernel 2.5 that supports
hyper-threading.
PAPI for performance counter optimization
MPI for message passing that bypasses the kernel
 
 
 
ASCI Benchmark Experiment
As part of the project, we have to install an ASCI benchmark
on the class cluster. We have decided to install sPPM. 
 
Hardware Setup:
The target machines are IBM X232 with dual Intel 2.0 Ghz
Xeon MP processors 
 
Experiment Setup:
 
A Master Node that send data to a slave hyper-threaded node
 
The hyper threaded node runs a long running application like
a render that can pipe-line i/o calls with communication to the master node.
The i/o should be long enough for noticeable delay to occur if it had been done
sequentially (Amdahlís Law). 
 
The hyper-thread will be spin-locking on a global variable
to keep in memory and then performs IO for the main thread when called
 
The main thread will be doing some sort of calculation work,
probably from the Spec suite of benchmarks. These benchmarks allow us to target
certain aspects and ALU units on the processor specifically.
 
 
Results
We will test the system in a hyper-threaded and
non-hyper-threaded environment. We expect to see major improvement if there is
a lot of communication overhead. The hyper-threading splits the architecture
including the buffer access when doing processing. Over small runs, the
overhead will not be beneficial, but with large long term renders or
compressions, the pipelining benefit of sending and receiving data will yield
higher performance. Our application will attempt to utilize the SIMD units to
maximize the performance. The limitation will be seen in code that uses the
same internal buffers. [1]
 
Questions and Concerns:
 - What is the actual processor
     overhead of an isend/irecv? 
- Does isend/irecv cause a context
     switch? 
- Does isend/irecv use DMA and
     what actual processor involvement is required?
  - We think that there is
      overhead of packetizing the data, even if it is in user space that does
      use some cpu resources and is not completely handled by the NIC.
- Does DMA fit in this
      equation?
- How many ALU/FP/SIMD units are
     available?
- Cache misses due to data size
     could cause instances where the working thread may stall and allow
     hyper-thread to run alone on cpu.
- Cache thrashing where
     hyper-thread and working thread working on data sets that are both not in
     memory
 
Update 4/24/2003
The latest update on the project is available 
here.
Final Report
The final report of the project is available 
here.
Website
 
http://www4.ncsu.edu/~nvouk/exploitinghyper.html
 
References:
 - The IA-32 Intel Architecture Software
     Developerís Manual, Volume 1: Basic Architecture (Order Number
     245470).
 ftp://download.intel.com/design/Pentium4/manuals/24547011.pdf
 
 - The IA-32 Intel Architecture Software
     Developerís Manual, Volume 2: Instruction Set Reference (Order
     Number 245471).
 ftp://download.intel.com/design/Pentium4/manuals/24547111.pdf
 
 
- The IA-32 Intel Architecture Software
     Developerís Manual, Volume 3: System Programming Guide (Order
     Number 245472).
 ftp://download.intel.com/design/Pentium4/manuals/24547210.pdf
 
 - D. Tullsen, S. Eggers, J.
     Emer, H. Levy, J. Lo, and R. Stamm, "Exploiting choice: Instruction
     fetch and issue on an implementable simultaneous multithreading
     processor," 23rd Annual International Symposium on Computer
     Architecture, May 1996.
 http://citeseer.nj.nec.com/cache/papers/cs/7286/http:zSzzSzwww.csrd.uiuc.eduzSz~ece412zSzpaperszSztullsen_ISCA96.pdf/tullsen96exploiting.pdf
- Download of performance libs
 http://www.intel.com/software/products/global/eval.htm#perflib
- Pentium optimized libraries
- http://www.intel.com/software/products/ipp/ipp30/index.htm
- Detailed Article on
     Hyper-threading in the Pentium Xeon
 http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm
- Intel Processor Programming
     Manuals
 http://developer.intel.com/design/Pentium4/manuals/
- Pentium 4 and the G4e:
     architectural Comparison
 http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-6.html
- IBM hyper-threading document
 https://mail.gininet.com/Redirect/www-106.ibm.com/developerworks/linux/library/l-htl/
- Spec C 2000 Test Suite
 http://www.specbench.org/osg/cpu2000/CINT2000/
 |