Modern processors provide a multitude of opportunities for instruction-level parallelism that most current applications cannot fully utilize. To increase processor core execution efficiency, modern processors can fetch instructions from two or more tasks simultaneously to the computation core in order to increase the executing rate of instructions per cycle (IPC). These processors implement simultaneous multi-threading (SMT), which increases processor efficiency through thread-level parallelism, but problems can arise in cache conflicts and functional unit collision. Consider high end applications typically running on clusters of commodity computers sending, receiving and computing data. Non-SMT processors must compute data, context switch, communicate that data, context switch, compute more data, and so on. Compute and communicate functions often utilize floating point functional units for computation, and integer functional units for communication. Until recently, modern communication libraries were not able to take complete advantage of this parallelism due to the lack of SMT hardware. This thesis explores the feasibility of exploiting this natural compute/ communicate parallelism in distributed applications, especially for applications that are not optimized for the constraints imposed by SMT hardware. This work also explores hardware and software thread synchronization primitives to reduce inter-thread communication latency and operating system context switch time in order to maximize a program's ability to compute and communicate simultaneously. We describe the design and implementation of a modified MPICH MPI library that allows legacy applications to take advantage of SMT processor parallelism. Also described is a thread-promoting buddy scheduler that allows threads to always be scheduled simultaneously thus reducing context switch overhead, scheduling overhead and memory latency. Finally, this work investigates the reduction of inter-thread communication through hardware synchronization primitives. The primitives allow threads to "instantly" notify each other of changes in program state. Overall, we show that distributed application performance can be further improved through exploitation of the native parallelism provided by SMT processors.