CSC548 Project

Project: Three NAS-PB Benchmark: IS/CG/DT Using CUDA optimization

Group Members:

Group Members	Email	Office
Fei Meng	fmeng AT ncsu.edu	EBII 3226
Abhishek Dhanotia	adhanot AT ncsu.edu	Partners I 2300
Fang Liu	fliu3 AT ncsu.edu	Partners I 2300

Problem Description:

CellBE and Nvidia GPUs are parallel architectures that could be exploited to accelerate simulations of parallel applications. In this project, we choose Nvidia GPU as the parallel computing platform. The parallelism could be two folders: one is thread level parallelism, in which the host offloads the kernel computation to available hardware thread blocks on a single GPU device; the other is process level parallelism, in which the entire computation are partitioned among multiple processes/hosts and each GPU device would take care of part of the assigned computation.

We choose three NAS PB 3.3 MPI benchmarks: CG, DT and IS. There are three members Abhishek Dhanotia, Fang Liu and Fei Meng working on each respectively. Using mpiP and gprof profiling as well as source code inspection, we obtain their major computation and communication hotspots that could be used for parallelism. Below the characterization of the three benchmarks: IS, CG and DT.

IS means integers sorting. Keys are generated by the sequential key generation algorithm, then sorted in parallel. CG stands for conjugate gradient and it approximates the largest eigenvalue of a sparse, symmetric, positive definite matrix, using inverse iteration. DT means data flow pattern analysis.

Possible hotspots of these benchmarks:

IS Benchmark Computation Hotspot:

The dominant computation of the IS benchmark is the rank calculation and random number generator on each node. Each node needs to sort its assigned partition and then communicate with each other to get the final sorted subset. Finally, every key gets its rank.
From the above analysis, three loops are optimized using CUDA kernel parallisim.

CG benchmark Computation Hotspot:

The dominant computation of the CG benchmark is the solution of a linear system which involves matrix-vector multiplications. The multiplication operation is repeated for each row in the sparse matrix for each non-zero element. This code is implemented in the conj_grad function in the cg.f file.
Using gprof profiling, we see that 92.3% of program time is spent on this function when the program is run on 4 processors with Class A input size. Similarly, 83.7% time is spent on this function when running on 16 processors.
So conj_grad function would be the best place where CUDA kernel can be invoked to extract parallelism and improve computation performance.

DT benchmark Computation Hotspot:

Communication data flow graph (DGF) of processing nodes is built at the initialization phase. Each source node generates an array of random number and sends the array to nodes attaching to it. Each comparator node receives arrays from up-streaming nodes and combines them into one array and sends the new array to nodes attaching to it. Each Sink node does reduction on the received array and sends it to the root process 0. The dominant computation is random number generator "RandomFeatures", which takes 100% and 50% of execution time with configuration of Class A, 21 Processes, graph type of BH, WH respectively. For Class B, 43 processes, it takes 27.27% and 42.86% of program time for BH and WH respectively.

Outline for solutions:

Student Name	Benchmark	Language	Computation hotspot function
Abhishek Dhanotia	CG	Fortran	conj_grad
Fang Liu	DT	C	RandomFeatures
Fei Meng	IS	C	rank

Final Report

HW5

HW4

References:

M. Frumkin. Data Flow Pattern Analysis of Scientific Applications
Stephen Whalen. Optimizing the NPB CG benchmark for multi-core AMD Opteron microprocessors
D. Bailey, E. et al. The NAS Parallel Benchmarks