Modern processors provide a multitude of opportunities for instruction-level 
parallelism that most current applications cannot fully utilize. To increase 
processor core execution efficiency, modern processors  can fetch instructions 
from two or more tasks simultaneously to the computation core in order to 
increase the executing rate of  instructions per cycle (IPC). These processors 
implement simultaneous multi-threading (SMT), which increases processor 
efficiency through thread-level parallelism, but problems can arise in cache 
conflicts and functional unit collision. Consider high end applications 
typically running on clusters of commodity computers sending, receiving and 
computing data. Non-SMT processors must compute data, context switch, 
communicate that data, context switch, compute more data, and so on. Compute 
and communicate functions often utilize floating point functional units for 
computation, and integer functional units for communication. Until recently, 
modern communication libraries were not able to take complete advantage of 
this parallelism due to the lack of SMT hardware.        

This thesis explores the feasibility of exploiting this natural compute/
communicate parallelism in distributed applications, especially for 
applications that are not optimized for the constraints imposed by SMT 
hardware. This work also explores hardware and software thread synchronization 
primitives to reduce inter-thread communication latency and operating system 
context switch time in order to maximize a program's ability to compute and 
communicate simultaneously. We describe the design and implementation of a 
modified MPICH MPI library that allows legacy applications to take advantage 
of SMT processor parallelism. Also described is a thread-promoting buddy 
scheduler that allows threads to always be scheduled simultaneously thus 
reducing context switch overhead, scheduling overhead and memory latency. 
Finally, this work investigates the reduction of inter-thread communication 
through hardware synchronization primitives. The primitives allow threads to 
"instantly" notify each other of changes in program state. Overall, we show 
that distributed application performance can be further improved through 
exploitation of the native parallelism provided by SMT processors.