Group Members | Office | |
Fei Meng | fmeng AT ncsu.edu | EBII 3226 |
Abhishek Dhanotia | adhanot AT ncsu.edu | Partners I 2300 |
Fang Liu | fliu3 AT ncsu.edu | Partners I 2300 |
CellBE and Nvidia GPUs are parallel architectures that could be exploited to accelerate simulations of parallel applications. In this project, we choose Nvidia GPU as the parallel computing platform. The parallelism could be two folders: one is thread level parallelism, in which the host offloads the kernel computation to available hardware thread blocks on a single GPU device; the other is process level parallelism, in which the entire computation are partitioned among multiple processes/hosts and each GPU device would take care of part of the assigned computation.
We choose three NAS PB 3.3 MPI benchmarks: CG, DT and IS. There are three members Abhishek Dhanotia, Fang Liu and Fei Meng working on each respectively. Using mpiP and gprof profiling as well as source code inspection, we obtain their major computation and communication hotspots that could be used for parallelism. Below the characterization of the three benchmarks: IS, CG and DT.
IS means integers sorting. Keys are generated by the sequential key generation algorithm, then sorted in parallel. CG stands for conjugate gradient and it approximates the largest eigenvalue of a sparse, symmetric, positive definite matrix, using inverse iteration. DT means data flow pattern analysis.
The dominant computation of the IS benchmark is the rank calculation and random number generator on each node. Each node needs to sort its assigned partition and then communicate with each other to get the final sorted subset. Finally, every key gets its rank.
From the above analysis, three loops are optimized using CUDA kernel parallisim.
The dominant computation of the CG benchmark is the solution of a linear system which involves matrix-vector multiplications. The multiplication operation is repeated for each row in the sparse matrix for each non-zero element. This code is implemented in the conj_grad function in the cg.f file.
Using gprof profiling, we see that 92.3% of program time is spent on this function when the program is run on 4 processors with Class A input size. Similarly, 83.7% time is spent on this function when running on 16 processors.
So conj_grad function would be the best place where CUDA kernel can be invoked to extract parallelism and improve computation performance.
Communication data flow graph (DGF) of processing nodes is built at the initialization phase. Each source node generates an array of random number and sends the array to nodes attaching to it. Each comparator node receives arrays from up-streaming nodes and combines them into one array and sends the new array to nodes attaching to it. Each Sink node does reduction on the received array and sends it to the root process 0. The dominant computation is random number generator "RandomFeatures", which takes 100% and 50% of execution time with configuration of Class A, 21 Processes, graph type of BH, WH respectively. For Class B, 43 processes, it takes 27.27% and 42.86% of program time for BH and WH respectively.
Student Name | Benchmark | Language | Computation hotspot function |
Abhishek Dhanotia | CG | Fortran | conj_grad |
Fang Liu | DT | C | RandomFeatures |
Fei Meng | IS | C | rank |