Parallel Systems

Optimization Approach and Results

CG:
The conj_grad function is the bottleneck function as given in gProf. Directives were applied in the one nested loops in the function. There were issues in the declaration part of the array which caused some compile time errors. Once they were resolved, CG showed sufficient performance improvement. Please refer the report for full performance results:

Input Size/Processors	Old Code Performance(in MOP/s)	New Code Performance(in MOP/s)
B/1	110.74	150.22
C/1	80.44	151.40
C/2	82.52	106.72

MG:
The inner loops in the resid function were merged together so that the directives could be applied. Since the total operation involved memory copy operations of large arrays, we could not get much optimization as we expected. The performance results for class C with 8 processors are given in the report. We also tried a blocking based approach for optimizing and got a better performance for 1 processor. We tried to apply this blocking based approach in CUDA but was not successful in it. Please refer to the report for the performance results of these approaches.

Method	Old Code Performance(in MOP/s)	New Code Performance(in MOP/s)
Cuda Directives after merging the inner loop(Class C with 8 procs)	1845.76	1912.45
Blocking(Class B with 1 proc)	650.926	717.89

FT:
fftz took the major time as given by gProf. It had a call to fftz2 function. As cuda kernels do not work properly with functions, we expanded the function inline. The variables used in the hotspot were complex datatypes so the directives were not enough to convert them to cuda code. So we had to write the cuda kernels manually. Refer the report for more details and performance reports. Since the time taken was due to the number of invocations of the same function, we could not extract much performance from the GPGPU.

Input Size/Processors	Old Code Performance(in MOP/s)	New Code Performance(in MOP/s)
A/1	533.72	531.39

Task Distribution

Task Owner	Task
Group Task	Analysis of F2C-ACC, f2c, installation of pgi compiler environment settings, added compiler directive for acceleration, compilation using the pgi accelerator directives.
Allen Pradeep Xavier	Performance evaluation of MG, Identifying the bottleneck function, performance evaluation after adding compiler directive and blocking Web page updation, Helping in solving errors in FT,CG benchmarks.
Anitta Jose	Performance evaluation of IS, Identifying the bottleneck function, Understanding the IS implementation, Identifying bottleneck in CG benchmark, Performance evaluation for CG after adding compiler directives.
Sreekanth Mavila	Performance evaluation of FT, Identifying the bottleneck function, Converting all the function calls in the hotspot function to inline functions. Writing cuda fortran kernel, Performance evaluation with cuda fortran kernel, Helping in solving errors in MG,CG benchmarks.

References

http://www.nersc.gov/nusers/systems/franklin/programming/cg_opt.pdf
http://www.nersc.gov/nusers/systems/franklin/programming/mg_opt.pdf
D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS parallel benchmarks. Report RNR-94-007, NASA Advanced Supercomputing Division, March 1994.

Benefits of Hardware Acceleration using CUDA in an MPI environment

Optimization Approach and Results

Task Distribution

References