Hardware Acceleration

The goal of our project is to assess the benefits of hardware acceleration using CellBE in an MPI environment. In order to create such hybrid environment, we need to use Cell Messaging Layer, a communication library for the Cell Broadband Engine, which many people recognize as the Playstation 3's microprocessor.

To evaluate the benefits of hardware acceleration, we chose a large benchmark AMG, an Algebraic Mult-Grid linear system solver for unstructured mesh physics packages. This benchmark is written in C and it is only developed to evaluate the parallelism of systems that use MPI and/or OpenMP. Our goal is to incorporate the AMG benchmark onto a CellBE/MPI hybrid environment. First, we need to profile the benchmark's performance to identify the main hotspot(s) in terms of performance using mpiP and gprof. Then, we must recode the hotspot(s) as a kernel on an accelerator using Cell Messaging Layer libraries in addition the Cell SDK. Then add DMA-based data movement. Finally, we compare the performance before and after for different number of nodes.

The challenges that we might have for this project will be:
•    Understand the AMG benchmark
•    Understand the Cell Messaging Layer
•    Identify the main hotspot(s) in terms of performance in computation and communication
•    Corporate the benchmark main hotspot(s) as a kernel
•    Compare the performance before and after the integration
•    Optimize the benchmark's main hotspot(s) for the accelerators


In order to develop a solution using the benefits of hardware acceleration for our
benchmark, AMG, we need to first understand how the benchmark works. Before we can
understand the code, we will first need to successfully compile the benchmark for
the henry2 system and install profiling layers. Then, we must profile the benchmark
for different input sizes and number of nodes to determine what the hotspots are for
computation. Depending on the potential for parallelism of these areas, these
functions may give us the largest benefit from parallelizing on the IBM Cell
processor. The last step of understanding the benchmark is to study the source code
of the functions in the hotspots.

Once we understand the benchmark, and more importantly the functions that we will be
parallelizing, we will need to develop a plan to split the work load of these
functions among the available nodes. This will involve deciding what data we need to
share, what data will be private to each node, and how to split the loops and/or
conditionals that do the computation.

Once we have a plan of what and how to parallelize, we will need to become familiar
with the Cell Messaging Layer before implementation. Afterwards we can begin
implementing our plan.

With our initial plan implemented, we can begin testing for new hot spots,
communication lagging, bottlenecks and load imbalances. With this new data, we will
be able to decide which areas we think we can feasibly improve and develop a
strategy to improve these areas.

Once all changes have been made, we can begin collecting data from our final
implementation. We can compare this data with the unmodified data and the data from
our initial implementation to determine what kind of results we were able to



T. 1: Understand the problem in the AMG benchmark
T. 2: Successfully compile the benchmark for henry2 system
T. 3: Profile the benchmark for different input sizes and number of nodes to determine the
T. 4: Study the source code of the function in the hotspots
T. 5: Prepare a plan to split the workload if these functions into available nodes
a. Decide what data we need to share
b. What data is private to each node
c. How to split the loops and conditionals
T. 6: Get familiar with the Cell Messaging layer
T. 7: Start coding (MPI Calls, DMAs, Calculations)
T. 8: Start testing for new hotspots (communication lagging, bottlenecks and imbalances)
T. 9: Collect data for final Implementation
T. 10: Compare this with unmodified data and the data from initial implementation
T. 11: Documentation/Webpage maintenance / Finalize and prepare a report




AMG benchmark page

OpenMP and Cell

Cell in Scientific Computing

Cell Wiki Page

Cell BE documentation page

A parallel algorithm for algebraic multigrid - paper

Cell Messaging Layer

HW4 Project Page

A.Kejariwal and C. Cascaval.  Parallelization Spectroscopy: Analysis of Thread-level Parallelism in HPC Programs.  In The 2nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA) 2009, pages 30-39, June 2009.