Parallel Systems

Milestones Achieved

1. Identifying the exact location of bottleneck in the function identified using gprof.

MG(MultiGrid) : In the function resid which computes residual (using the formulae r = v - Au) there is a 3 level nested loop. In this there are 2 inner loops also. The time per call for this function is 0.07s and this is invoked 170 times. This causes a total execution time of 12.43s which accounts for 42.52% of the total execution time. This nested loop is the one which needs to be accelerated.
The psinv function also does some stencil computation inside a nested loop. This function takes 0.03 s to complete and is invoked 168 times. This accounts for a execution time of 5.74. The nested loop in this can also be accelerated.

FT(Fast Fourier Transform) : Bottleneck in computation was identified using the gprof tool. Refer to the project report for the analysis for class B with 4 processors.
Sub routine fftz2 is taking around 54.52%. But the number of calls made to the function is 923648 and is computationally less intensive. Analysis further shows that subroutines cffts1 and cffts2 takes around 6.5 seconds and being considerably intensive computation.

IS(Integer Sort) : This benchmark does sorting of large key set by ranking them according to their value. Initially the entire set is distributed among all the processors present. Bucket sorting logic is used for ranking. The keys in one processor are initially put into different buffers(buckets) based on the key value. Rank of a key depends on its position. Then the NUMKEYS(Maximum number of keys in a processor)keys from lower buckets is moved to 1st processor ,next NUMKEYS in the 2nd processor and likewise using MPI communication calls and then the next iteration is performed . There are 10 buckets and 10 iterations being performed which makes sure the key set is properly sorted. There are different levels of verification (partial verification and full verification) being done to ensure a sorted key set. This is not a computation intensive function.
The only computation involved is the generation of keys, which might be parallelized.

2. Writing CUDA kernel . Decide on Fortran/C/Using Fortran to Cuda compiler / Using PGI Fortran Cuda compiler. Below are the options we tried .

F2C-ACC :- The specification says that this will convert a fortran code to cuda/c. When we tried for conversion to cuda we got 2 kinds of errors which are
1. 'Language Construct' not supported for keywords like common.
2. 'Type not supported' for keywords like double precision, external
f2c :- This is used for converting a fortran code to C. We were not able to successfully use this as because there was nothing specified about how to specify the included files. When we executed f2c it gave undefined error for all the MPI calls.
PGI Accelerator :- The PGI 9.0 release includes the PGI Accelerator Fortran and C99 compilers supporting x64+NVIDIA Linux systems; PGF95 and PGCC accelerator compilers are supported on all Intel and AMD x64 processor-based systems with CUDA-enabled NVIDIA GPUs. This has option for specifying directives similar to openMP which can be used for acceleration. The PGI Accelerator compilers automatically analyze whole program structure and data, split portions of the application between the x64 CPU and GPU as specified by user directives, define and generate an optimized mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern GPUs. We were able to compile the our codes using PGF95 and PGCC. So we have decided to go ahead with pgi compiler.

MG:- We added pgi accelerator compiler directive in the resid function for the nested loop. It showed improvement in this particular function. Refer to diagrams in the report file. The problem we faced was that once we added the compiler directive the total execution time increased by 3 times. Now we are analyzing this problem.
FT:- Based on the understanding from the gprof we understood the computation is intensive for the subroutines called during the FFT computation: cffts1, cffts2, cffts3, each called for different layouts given. Code walk though identified that all these functions calls the subroutine cfftz which in turn calls subroutine fftz. This accounts for the large number of calls for the functions cfftz and fftz. One option for accelerating the code will be writing these nested subroutines inline to the main functions and then applying the CUDA directives. Similar to MG we tried adding directives for FT so that we can confirm our analysis was proper. But we figured out that the compiler directive is not taking effect. Today we repeated the same as above with MG. We found out that the directives are not taking effect.
IS:- Trying to figure out how to optimize the rank function which is not very computation intensive. We are also looking at optimizing the random number generator function.

Task Owner	Task
Group Task	Analysis of F2C-ACC,f2c,installation of pgi compiler,environment settings,added compiler directive for acceleration,compilation using the pgi accelerator directives.
Allen Pradeep Xavier	Performance evaluation of MG, Identifying the bottleneck function,performance evaluation after adding compiler directive.
Anitta Jose	Performance evaluation of IS, Identifying the bottleneck function, Understanding the IS implementation.
Sreekanth Mavila	Performance evaluation of FT, Identifying the bottleneck function, Converting all the function calls in the hotspot function to inline functions.

Investigating the cause for pgi compiler directive not taking effect in the recent exection.
Investigating the cause slow down in the overall execution time when the directive is added.
Identifying the correct location for optimization of IS as the hotspot function shown as per gprof is not computation intensive.
If directive is not supported write kernel functions for CUDA