PARALLEL SYSTEMS PROJECT

CSC 548 Fall 2009, under Dr. Frank Mueller

INDEX:

Result Summary

Contributions

Detailed Analysis

Approach Towards Solution

Initial Results

Roadmap

Timeline and Status

Future Work

Downloads

References

Team Member Name Email id

Kaustubh Prabhu ksprabhu AT ncsu DOT edu

Narayanan Subramanian nsubram AT ncsu DOT edu

Ritesh Anand ranand AT ncsu DOT edu

INDEX:
Result Summary
Contributions
Detailed Analysis
Approach Towards Solution
Initial Results
Roadmap
Timeline and Status
Future Work
Downloads
References

Team Member Name	Email id
Kaustubh Prabhu	ksprabhu AT ncsu DOT edu
Narayanan Subramanian	nsubram AT ncsu DOT edu
Ritesh Anand	ranand AT ncsu DOT edu

Problem Description:

Assessing the benefits of CUDA accelerator on AMG2006 Benchmark of ASC Sequoia

AMG2006 is a parallel algebraic multi-grid solver for linear systems arising from problems on unstructured grids [1].
This benchmark is an SPMD C code which uses MPI [1] (extensively) and openMP constructs. There was a performance study done some time back on OpenMP for this application, but a full exploitation of threading is still to be done [2]. Due to this SPMD paradigm and its usage of data parallelism by forming logical sub-problem data grids we see tremendous scope for making use of NVIDIA CUDA accelerator which bases its parallelism on similar lines.
NVIDIA CUDA is general purpose programming architecture that facilitates usage of parallel processing capabilities of a class of GPUs (mostly from NVIDIA) in general purpose programming rather than restricting it to graphics processing [3].
Through our extensive searching over www, we have concluded that none such CUDA implementation for the AMG2006 benchmark exists that we are aware of at this point in time.

Result Summary

TOP

The following table represents the head to head execution time comparison with and without CUDA kernel execution. The concerned function is “hypre_BoomerAMGRelax”.

Conclusion

We have observed a performance improvement in function execution time when the application AMG2006 was ported to CUDA. Due to memory constraints we could not exploit this at higher problem size. Hence the observed speedup is limited for less problem size and relatively fewer numbers of nodes.

This work attempts to assess the speedup obtained by employing parallelism capabilities of general purpose graphic processing unit in NVIDIA graphics cards, using CUDA, on AMG2006 which is a MPI C code parallel algebraic multi-grid solver for linear systems arising from problems on unstructured grids.

In our analysis we also found that AMG2006 is memory-access bound, doing only about 1-2 computations per memory access, so memory-access speeds will have a large impact on performance. This observation is also made in the read me of the benchmark. Knowing that this hardware accelerator in GeForce GTX 280 is tuned towards data-parallel compute extensive tasks, like in graphics rendering, we do not expect this to be extracting much speedup on this benchmark. We expect it to actually show a fractional speedup.

The main reasons are -
1. Huge data structures (matrices, vectors, etc.) are to be worked upon by the Kernel.
2. Due to the size of data structure we cannot leverage spatial locality optimally as size of cached constant and texture memory supported is rather limited.
3. The accessing pattern of data structures is very irregular. Array indices are often stored as values in some other array, hence the strides in access very unpredictable. E.g. one of the array access inside kernel is a[b[i]] where a[i] and b[i] are both device arrays.
4. We fall back on global memory access, which are time expensive and on top of it, computation to I/O ratio is very unfavorably biased.
However, keeping in mind these limitations we have achieved –

1. Integration of CUDA kernel and AMG2006 application
2. Speedup in function execution time, which involves the computation kernel
3. A method which can be applied to port any computational kernel on CUDA

For this, we have used GeForce GTX 280 NVIDIA graphics cards available on the osXX machines having AMD Athlon XP processors. We have been able to completely port the benchmarking applications on CUDA. One of the identified compute intensive kernels has been successfully ported on CUDA. Currently, we can observe speedup in function execution (through gprof).

Contributions

TOP

Since our last reported progress, in which we had correctly identified the main performance bottlenecks which could be optimized, following are the main solved issues:

1. Compiling AMG2006 on CUDA cluster
2. Porting whole application to CUDA
3. Porting compute intensive performance bottlenecks to CUDA kernels
4. Running AMG2006 applications with CUDA (limited semantics)
5. Optimizing CUDA code for lesser problem size

Detailed Analysis

TOP

We began the work from performance analysis of AMG2006 on regular AMD Athlon XP cluster. We could find the performance bottlenecks and hence the targets for optimization. This is aptly captured in our previous reports and on our project web page.

Below we describe the major tasks undertaken and how we solved it, with issue faced if any.

1 Compiling:

This was our first r0ad block. After 2 weeks of almost continuous effort we were able to successfully compile AMG2006 applications with CUDA.
Problems Faced:
Compiling the AMG2006 application involved multiple Makefiles. Integration of application specific makefiles and CUDA makefiles was main blockade. As our code was creating a custom library file, integration of makefile was difficult. The CUDA programming structure doesn’t accept relative path for included library files.
What worked:
nvcc compiler provided by CUDA. nvcc allowed us to create object files for custom libraries.

2 Porting whole application:

Our approach to port to CUDA was:
(1) Identifying kernel regions (code sections to be executed on the GPU)
(2) Extracting kernel regions and transforming them into CUDA kernel functions, and
(3) Analyzing shared data that will be accessed by the GPU and inserting necessary memory transfer calls.

Problems Faced:
Identification of computational kernels which can be successfully ported to CUDA. Many of the computational kernels involved custom library calls like (HYPRE_CTAlloc) which can not be ported to CUDA. Some of the hot spots had to be ignored because of this device specific constraint.
What worked:
Any loops with multiple entry-exit points were rejected. Any compute intensive code with a custom library call was rejected. The elimination method proved useful.

3 Porting Kernels:

Problems Faced:
A) Primary concern was memory constraints while creating device specific data structures. As the problem size of application increased, CUDA memory was insufficient.
B) The application accesses non uniform array patterns. This involves heavy usage of array[array[offset]] pattern. This creates limitation on the memory access.
C) The size of device specific arrays is hard to find. Furthermore, due to complex structure of the code, it was difficult to access the size of device specific arrays. This is still the Major obstacle.
What worked:

A) We limited our scope to optimize the performance of AMG2006 for reduced problem size.
B) Overprovision of memory is a simple approach we have used in this code. Another possible approach is to access the relevant code section, where the arrays are getting populated and return the size. This will involve changing the code.

4 Optimization:

By optimization we mean the performance improvement in execution time of the compute intensive kernel, and respective improvement in function execution time. We have observed that the function execution time reduces drastically. CUDA computes the kernel much faster but also involves copying of huge data structures.

Problems Faced:
The memory constraints on the device specific data structures did not allow us to exploit CUDA kernels for large problem size.
We were able to solve the issue only to certain extent, given the scope of the project.

Approach Towards Solution:

TOP

The benchmark has many applications within this like Laplace, 27pt, Stencil, Linear, Solver, etc. We observed that on doing a weak scaling, we could find some hotspots where there is scope for computational improvement. Some results with problem scaling on varied large number of processing element are available in the “amg2006.readme” [1]. We will be comparing those trends with what we observe employing CUDA.
Though our initial analysis using GNU profiling tool gprof [4] we observe that there can be many performance bottlenecks, as some compute intensive functions are consuming a lot of total application time.
This optimization will be done by porting those methods (excluding library and initialization ones) in CUDA.
For computing the 3D laplace on a cube and the 27-point stencil problem, we came up with a list of methods that consistently become the bottleneck of execution. On doing a code walkthrough, we found that they are composed of simple c code with iterative structures acting on huge data list. We believe such code blocks can be speeded up by using the concept of GPU threads, which is likely to result in a better performance.

Initial Results:

TOP

After profiling the application AMG2006 for successive runs, we find the following functions could be the performance bottleneck -

functions	Frequency of occurrence
hypre_BoomerAMGBuildCoarseOperator*	10
hypre_BoomerAMGCoarsen*	9
hypre_BoomerAMGBuildInterp	7
hypre_BoomerAMGRelax*	6
hypre_BigBinarySearch^	3
hypre_CSRMatrixMatvec	2

* Functions consistently show up as bottlenecks, indentified by repeated runs for considerable size.

^ Function now removed from potential bottleneck, as it is called a lot of times and hence the overall time taken is large.

Some other result logs can be seen here.

Some Loop analysis results:

hypre_BoomerAMGBuildCoarseOperator (AMG2006/parcsr_ls/par_rap.c)

Line No.	Type of Loop	Nested	Level of nesting	Remarks
541- 741	for	Yes	5	Type1
768-1021	for	Yes	5	Type1
1148- 1376	for	Yes	5	Type1
1427-1747	for	Yes	5	Type1

hypre_BoomerAMGRelax (AMG2006/parcsr_ls/par_relax.c)

Line No.	Type of Loop	Nested	Level of nesting	Remarks
189-211	for	Yes	2	Type1
223-247	for	Yes	2	Type1
303-324	for	Yes	2	Type1
336-359	for	Yes	2	Type1
425-464	for	Yes	3	Type2
470-491	for	Yes	2	Type1
511-552	for	Yes	3	Type2
607-655	for	Yes	3	Type2

hypre_BoomerAMGCoarsen (AMG2006/parcsr_ls/par_coursen.c)

Line No.	Type of Loop	Nested	Level of nesting	Remarks
628-827	for	Yes	3	Single Bottleneck

Roadmap:

TOP

1.) Do a profiling for this application and gather the statistics. Observe trends and scope for optimization.
2.) Run the application for different problem sizes, different processor topologies, and different applications.
3.) List those functions which consume more computational time. Omit the library functions and the set up related functions.
4.) Study the functions and port them independently on CUDA.
5.) Integrate the ported functions with the application
6.) Run, profile and compare the results to observe trend and draw conclusions.

Work distribution and timeline:

TOP

Tasks Deadline Assignment Status

1.) Profiling and Bottlenecks identification 11/10/09 All Complete

2.) Initial Proting to CUDA with minimal changes 11/13/09 All Complete

3.) Application optimization design 11/15/09 Applications divided among members Complete

4.) Application optimized implementation 11/22/09 Functions divided among members Complete

5.) Modified Application profiling 11/24/09 All Complete

6.) Reiteration if needed 11/26/09 All Complete

7.) Drawing results and conclusions 11/27/09 All Complete