Performance Analysis Tools Exercise

This brief exercise covers only several of the tools presented in the tutorial. Individual workshops are offered for the more sophisticated tools, such as VampirGuideView, Paraver, Dimemas, etc.

Login to the SP machine
Workshops differ in how this is done. The instructor will go over this beforehand.
Copy the example files
In your home directory, create a subdirectory for the example codes and cd to it. Then copy the example codes.
```
mkdir ~/performance_tools 
cd  ~/performance_tools
cp -R /usr/local/spclass/blaise/performance_tools/samples/*   ~/performance_tools
```

List the contents of your performance_tools subdirectory

You should notice the following files:

File Name Description
Solutions Example solutions and reports
mpi_array.f Simple MPI program used to demonstrate mpiP. Fortran version.
mpi_array.c Simple MPI program used to demonstrate mpiP. C version.
mpi_prime.f Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. Fortran version.
mpi_prime.c Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. C version.
tuned.c Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
tuned.f Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.
untuned.c Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
untuned.f Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.

MPI timing routines
If you've never used the MPI_Wtime timing routine before, try using it to time the mpi_prime code. Determine the overall time it takes to calculate the primes within the given range. Print out the overall time and also the timer resolution.
1. Modify the mpi_prime.c or mpi_prime.f code. You'll need to do at least the following:
  - Declare double precision variables for use with the MPI timing routines
  - Make a call to the MPI_Wtime routine before calculations begin
  - Make a call to the MPI_Wtime routine after all results are collected (from within task 0)
  - Determine overall time by subtracting the value of the first MPI_Wtime call from the second MPI_Wtime call.
  - Make a call to MPI_Wtick for the timer resolution
2. Compile your modified code - either C or Fortran:
```
mpxlc_r -lm mpi_prime.c -o mpi_prime
mpxlf_r mpi_prime.f -o mpi_prime
```
3. Set the necessary POE environment variables
```
setenv MP_RMPOOL 1
setenv MP_PROCS 4
setenv MP_NODES 1
```
4. Run the program:
```
mpi_prime
```
5. Compare your timing results against those from the same code in the Solutions directory.
prof
This example uses two simple versions of the same file - one version is untuned and one is tuned. The prof utility is used to demonstrate how easy it is to test your efforts at program optimization.
1. First, examine the untuned version of the program (untuned.c or untuned.f) and note that four routines are called: pipe, unroll, strength, block
2. Compile either the C or Fortran untuned program being sure to use the -p flag, which permits profiling with prof:
```
xlc_r -p -lm untuned.c -o untuned
xlf_r -p untuned.f -o untuned
```
3. Run the untuned version.
```
untuned
```
4. After the program completes list the contents of your directory. You should have a mon.out file. This file contains the profile statistics from your run, however it is not a human readable file. You need to use the prof utility to produce a human readable profile report:
```
prof > untuned.profile
```
5. Examine your profile report (untuned.profile) and note the timings for each routine contained in the untuned program. Also note that there are system library routines in the report. For some reason, the Fortran report shows quite a number more system library routines than the C report. Example reports are in your Solutions subdirectory.
6. Now, compile either the C or Fortran tuned version of the code:
```
xlc_r -p -lm -lessl tuned.c -o tuned
xlf_r -p -lessl tuned.f -o tuned
```
7. Run the tuned version. As before, a mon.out file will be produced after your program completes. Then, use prof to produce a human readable profile report:
```
tuned
prof > tuned.profile
```
8. Compare the timings of the routines in the tuned report versus the untuned report. With the exception of the pipe routine, the tuned version routines should all perform better. The comments in the tuned version of the example code briefly explain what was done to achieve the performance improvement. Example tuned profile reports are also in your Solutions directory.
9. The compiler is generally your friend when it comes to program optimization. In this step, compile the untuned version of the code with moderate optimization:
```
xlc_r -p -lm -O2 untuned.c -o untuned.opt
xlf_r -p -O2 untuned.f -o untuned.opt
```
10. Run the compiler optimized untuned version and produce a profile report:
```
untuned.opt
prof > untuned.opt.profile
```
  *NOTE: The C compiler optimizer currently produces wrong results for the strength and block routines.
11. Compare your untuned.opt.profile against your previous two reports. Notice that all four routines in the example code demonstrate improved performance - in some cases even better than the tuned version without optimization.
12. Finally, compile and run the tuned version with optimization and produce a profile report:
```
xlc_r -p -lm -lessl -O2 tuned.c -o tuned.opt
xlf_r -p -lessl -O2 tuned.f -o tuned.opt
```
  then
```
tuned.opt
prof > tuned.opt.profile
```
13. The optimized tuned profile report should show the best timings of all. Example reports are in your Solutions subdirectory.
gprof
The gprof utility provides all of the information of the prof utility, plus information about which routines call/are called by other routines (the call graph). For large complex programs, the gprof report can be intimidating. In this exercise, we'll use the simple mpi_prime parallel program to highlight the main features of gprof.
1. Get a fresh copy of the mpi_prime code by using the copy in your Solutions subdirectory. For either C or Fortran:
```
cd ~/performance_tools
cp Solutions/mpi_prime.c  gprof.prime.c 
cp Solutions/mpi_prime.f  gprof.prime.f
```
2. Compile your C or Fortran gprof.prime code with the -pg flag to permit use of gprof profiling:
```
mpxlc_r -lm -pg gprof.prime.c -o gprof.prime
mpxlf_r -pg gprof.prime.f -o gprof.prime
```
3. Set the necessary POE environment variables
```
setenv MP_RMPOOL 1
setenv MP_PROCS 4
setenv MP_NODES 1
```
4. Run the program. When it completes, you should have 4 gmon.out.* files - one for each MPI task. Your execution and output should look similar to that shown below (user input is colored blue):
```
smurf01% gprof.prime
   0:Using 4 tasks to scan 2500000 numbers
   0:Done. Largest prime is 2499997 Total primes 183072
   0:Wallclock time elapsed: 3.454344e+00 seconds
   0:Timer resolution= 1.333333e-08 seconds
smurf01% ls -l gmon*
-rw-------   1 blaise   blaise      2083444 Jul 17 17:01 gmon.out.0
-rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.1
-rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.2
-rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.3
```
5. The gmon.out files contain the execution information for each MPI task, however, they are not human readable. You must use the gprof utility in order to produce a readable profile report.
  Try any/all of the following commands to generate a gprof report for selected gmon.out.* file(s):
```
gprof gprof.prime gmon.out.0 > gprof.prime.rpt
gprof gprof.prime gmon.out.1 gmon.out.2 > gprof.prime.rpt
gprof gprof.prime gmon.out.* > gprof.prime.rpt
```
  Note that for reports produced by multiple gmon.out.* files, the results are summed into a single report.
6. Examine your gprof.prime.rpt. The report header explains most of what the report contains. Several things to note:
  - Both user and library functions appear in the report. There are usually quite a number of library functions of little interest.
  - Every function has a unique index number associated with it to aid in cross-referencing its information elsewhere in the report
  - Every function shows it's parent (calling) function and any children (called) functions.
  - The routines that consume the most cpu appear at the top of the report
  - Several different timing statistics are provided for each function
  - The report is actually comprised of three sections in the following order:
    1. Call Graph section
    2. Flat profile (similar to what prof produces)
    3. Index summary
    Note that you will have to scroll down quite a ways to get from one section of the report to another.
xprofiler
The xprofiler tool provides a graphical representation of the information produced by the gprof utility. In fact, it uses the same gmon.out.* files produced when a program is compiled with the -pg flag. Additionally, xprofiler provides text reports of gprof information and can profile "ticks" at the source code line level.
1. Using either the C or Fortran version of the "tuned" program, compile it with both the -pg and -g flags to produce an a.out executable file.
```
xlc_r -pg -g -lm -lessl tuned.c 
xlf_r -pg -g -lessl tuned.f
```
2. Run your a.out program as usual. It should produce a gmon.out file after it has completed.
3. Make sure your X-Windows environment is setup correctly, and then start xprofiler:
```
xprofiler a.out gmon.out &
```
4. If everything is setup correctly, the xprofiler GUI will appear loaded with your code. Example here.
5. Notice that functions/routines are "clustered" into boxes. In the typical case, system and other libraries will comprise a significant part of the display. For the purposes of this exercise, we are not interested in these. To remove the libraries' information from the display, select the "Hide All Library Calls" option from the "Filter" pull-down menu. Example here.
6. After hiding all library information, your xprofiler display should show only the clustered function box for a.out (your program). Example here.
7. To view detailed information about a.out, you will usually need to do the following:
  - Select "Uncluster Functions" from the "Filter" pull-down menu. Example here.
  - This will remove the "box" from around a.out routines. Example here.
  - Select "Zoom In" from the "View" pull-down menu. Example here.
  - Your mouse icon will turn into a pointing hand. Just left-click anywhere within your a.out display area. Detailed information about the functions and routines in a.out will then appear. Example here.
8. Review the functions and arcs displayed by xprofiler. Functions are represented by green boxes and arcs by blue lines that connect functions. Some points to note:
  - The size/shape of a function box represents its execution times, which are also shown as text. The width/first number represents execution time due to descendents. The height/second number represents execution time due to that function by itself. If the function has no descendents, then it will be square and both timing numbers will be the same.
  - Arcs show which function calls another function and how many times.
  - Example 1: The main routine takes 42.300 seconds to execute, all of which is due to the four routines it calls.
  - Example 2: The block routine takes 0.070 seconds to execute, all of which is due to itself. The strength routine takes 1.990 seconds to execute, all due to itself even though it calls another routine. The pipe routine takes 3.760 seconds to execute, all due to itself.
  - Example 3: The sin routine is called 50000 times by the strength routine, but takes virtually no recorded time to execute. The big unroll routine takes takes 36.480 seconds to execute, all of which is due to itself.
9. Functions and arcs have "hidden" menus, which pop-open when clicked on with the right mouse button. Right mouse click on any function box to view its hidden menu. Example here
10. Select "Show Source Code" from the unroll function menu. A new window will appear showing the source code. Example here
11. Scroll through the source code and view the "ticks" wherever they appear. Each "tick" represents 1/100th of a second, so multiplying the number of ticks times 100 approximates the amount of time spent on a given source line. Example here
12. Right mouse click on the strength routine and then select "Statistics Report" from its hidden menu. Example here
13. A new window will appear showing execution statistics about the strength function. Example here
14. Left mouse click on xprofiler's "Report" pull-down menu to see what plain text reports are available. Example here
15. Try viewing any/all of the available reports. An example of each is provided below.
16. Review xprofiler's other features as desired, or exit it whenever you wish.

File Name	Description
`Solutions`	Example solutions and reports
`mpi_array.f`	Simple MPI program used to demonstrate mpiP. Fortran version.
`mpi_array.c`	Simple MPI program used to demonstrate mpiP. C version.
`mpi_prime.f`	Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. Fortran version.
`mpi_prime.c`	Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. C version.
`tuned.c`	Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
`tuned.f`	Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.
`untuned.c`	Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
`untuned.f`	Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.

mpiP

mpiP is a lightweight profiling library for MPI routines, providing statistical information about the performance of your program's MPI routine calls. mpiP is LLNL developed beta software installed on ASCI systems under /usr/local/mpiP.

Review the README file located with the software under /usr/local/mpiP.

Compile either the C or Fortran example file so that it can be used with mpiP:

mpxlc_r -g -o array mpi_array.c -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
mpxlf_r -o array mpi_array.f -g -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm

Set the necessary POE environment variables:

setenv MP_RMPOOL 1
setenv MP_PROCS 8
setenv MP_NODES 2

Run the array program. If mpiP is linked in correctly, you will notice mpiP messages intermixed with the program output. In the example output below, mpiP output messages are colored blue.

smurf01% array 0:mpiP: 0:mpiP: 0:mpiP: mpiP V2.3 (Build Aug 28 2001/11:55:57) 0:mpiP: Direct questions and errors to Jeffrey Vetter 0:mpiP: 0:MPI task 0 has started... 4:MPI task 4 has started... 1:MPI task 1 has started... 5:MPI task 5 has started... 2:MPI task 2 has started... 6:MPI task 6 has started... 3:MPI task 3 has started... 7:MPI task 7 has started... 0:Initialized array sum = 1.335708e+14 0:Sent 2000000 elements to task 1 offset= 2000000 0:Sent 2000000 elements to task 2 offset= 4000000 0:Sent 2000000 elements to task 3 offset= 6000000 1:Task 1 mysum = 1.196314e+13 0:Sent 2000000 elements to task 4 offset= 8000000 2:Task 2 mysum = 2.023766e+13 0:Sent 2000000 elements to task 5 offset= 10000000 3:Task 3 mysum = 2.784929e+13 0:Sent 2000000 elements to task 6 offset= 12000000 4:Task 4 mysum = 3.593385e+13 0:Sent 2000000 elements to task 7 offset= 14000000 5:Task 5 mysum = 4.467700e+13 6:Task 6 mysum = 5.218496e+13 0:Task 0 mysum = 4.000093e+12 7:Task 7 mysum = 5.942945e+13 0:Sample results: 0: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00 0: 4.000000e+06 4.000002e+06 4.000004e+06 4.000006e+06 4.000008e+06 0: 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06 0: 1.200000e+07 1.200000e+07 1.200000e+07 1.200001e+07 1.200001e+07 0: 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07 0: 2.000000e+07 2.000000e+07 2.000000e+07 2.000001e+07 2.000001e+07 0: 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07 0: 2.800000e+07 2.800000e+07 2.800000e+07 2.800001e+07 2.800001e+07 0:*** Final sum= 2.562754e+14 *** 0:mpiP: 0:mpiP: found 21575 symbols in file [array] 0:mpiP: 0:mpiP: Storing mpiP output in [./array.8.43046.mpiP]. 0:mpiP:

After the example code completes, review the mpiP report file. The output file name will have the format of filename.N.XXXXX.mpiP where filename=executable name, N=#MPI tasks and XXXXX=collector task process id.
Examine the output file and notice what information is profiled for the MPI routines. Example C and Fortran mpiP output reports are provided in your Solutions subdirectory.

hpmcount

The hpmcount command-line utility is used to launch a specified program, capture run-time hardware performance / utilization information, and provide summary statistical output. It is based upon an architecture's physical hardware counters and hardware events.

For this exercise, we will use the version of hpmcount that works with the IBM 604e architecture, since the machines we are using are of that type. Note that using hpmcount on ASCI White differs from what is shown here (but not by much).

First, set up an alias to use the correct version of the hpmcount utility for the IBM 604e architecture:
```
alias hpmcount /usr/local/HPM/hpmcount604
```
Compile your untuned code, either C or Fortran, to produce an a.out file:
```
xlc_r -lm untuned.c
xlf_r untuned.f
```
Run your a.out file using hpmcount with the default set of hardware counters:
```
hpmcount a.out > untuned.hpmcount.rpt
```

Examine your output report. It should resemble that shown below:

adding counter 0 event 15 FPU instructions adding counter 1 event 6 Data Cache Misses adding counter 2 event 2 Instructions completed adding counter 3 event 1 Cycles Running pipe()... s = 7.980000e+10 Running unroll()... s= 3.349600e+10 Running strength()... s= 8.429910e+04 Running block()... c[511][511] = 6.684706e+07 hpmcount (V 2.1.0) summary Total execution time (wall clock time): 116.626656 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 116.420000 seconds Total amount of time in system mode : 0.180000 seconds Maximum resident set size : 7552 Kbytes Average shared memory use in text segment : 139884 Kbytes*sec Average unshared memory use in data segment : 45828508 Kbytes*sec Number of page faults without I/O activity : 1898 Number of page faults with I/O activity : 0 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 41 Number of involuntary context switches : 290 ####### End of Resource Statistics ######## PM_FPU_CMPL (FPU instructions) : 1735013853 PM_DC_MISS (Data Cache Misses) : 386777959 PM_INST_CMPL (Instructions completed) : 21418551093 PM_CYC (Cycles) : 38050190097 Cycles per instruction : 1.777 Instructions per cycle : 0.563

Now, compile your tuned version of the code and run hpmcount with it.

xlc_r -lm -lessl tuned.c
xlf_r -lessl untuned.f

then

hpmcount a.out > tuned.hpmcount.rpt

Compare the two reports (untuned versus tuned) and notice the more efficient use of the 604e hardware by the tuned code. An example tuned hpmcount report is available. In particular notice:
- Less execution time
- Fewer floating point instructions
- Less data cache misses
- Fewer overall instructions required
- Fewer overall cycles used
- Lower cycles per instruction ratio
- Higher instructions per cycle ratio

This completes the exercise.

Please complete the online evaluation form.

Where would you like to go now?