Optimization Approach and Results
CG:
The conj_grad function is the bottleneck function as given in gProf. Directives were applied in the one nested loops in the function. There were issues in the declaration part of the array which caused some compile time errors. Once they were resolved, CG showed sufficient performance improvement. Please refer the report for full performance results:
Input Size/Processors | Old Code Performance(in MOP/s) | New Code Performance(in MOP/s) |
B/1 | 110.74 | 150.22 |
C/1 | 80.44 | 151.40 |
C/2 | 82.52 | 106.72 |
MG:
The inner loops in the resid function were merged together so that the directives could be applied. Since the total operation involved memory copy operations of large arrays, we could not get much optimization as we expected. The performance results for class C with 8 processors are given in the report. We also tried a blocking based approach for optimizing and got a better performance for 1 processor. We tried to apply this blocking based approach in CUDA but was not successful in it. Please refer to the report for the performance results of these approaches.
Method | Old Code Performance(in MOP/s) | New Code Performance(in MOP/s) |
Cuda Directives after merging the inner loop(Class C with 8 procs) | 1845.76 | 1912.45 |
Blocking(Class B with 1 proc) | 650.926 | 717.89 |
FT:
fftz took the major time as given by gProf. It had a call to fftz2 function. As cuda kernels do not work properly with functions, we expanded the function inline. The variables used in the hotspot were complex datatypes so the directives were not enough to convert them to cuda code. So we had to write the cuda kernels manually. Refer the report for more details and performance reports. Since the time taken was due to the number of invocations of the same function, we could not extract much performance from the GPGPU.
Input Size/Processors | Old Code Performance(in MOP/s) | New Code Performance(in MOP/s) |
A/1 | 533.72 | 531.39 |