Problem Description and Solution approach
We have selected the following three NAS PB benchmarks to optimize in CUDA.
- MG(MultiGrid) : Approximate the solution to a three-dimensional discrete Poisson equation using the V-cycle multigrid method[6]
- FT(Fast Fourier Transform) : Solve a three-dimensional partial differential equation (PDE) using the fast Fourier transform (FFT)[6]
- IS(Integer Sort) : Sort small integers using the bucket sort.[6]
We have profiled these benchmarks using gprof[5]. Below are the bottlenecks we could find for each of these benchmarks
- MG :- Subroutine resid which computes the residual execute for 44.07% of time though the number of calls being made to that call is only 170. This function calculates the residual in a loop. So making this loop calculation to execute in parallel we hope to achieve a better performance.[7]
- FT:- Subroutine fftz2 shows the maximum % execution time. But there are 230912 calls made to this function. So we can conclude this high execution time may not be because this subroutine computation intensive. The subroutine which shows next highest execution time is evolve . This also has a loop calculation which can be parallelized.[7]
- IS:- Two subroutines rank and randlc shows maximum execution time. Here randlc is a random number generator which is invoked 8388631 times. But at the same time rank is invoked only 11 times and it gives a similar exection time as randlc. We plan to optimize performance of this function by parallelizing the loop calculation in this function.[7]