Performance Analysis Tools Exercise

This brief exercise covers only several of the tools presented in the tutorial. Individual workshops are offered for the more sophisticated tools, such as VampirGuideView, Paraver, Dimemas, etc.

  1. Login to the SP machine

    Workshops differ in how this is done. The instructor will go over this beforehand.

  2. Copy the example files

    In your home directory, create a subdirectory for the example codes and cd to it. Then copy the example codes.

    mkdir ~/performance_tools 
    cd  ~/performance_tools
    cp -R /usr/local/spclass/blaise/performance_tools/samples/*   ~/performance_tools
    

  3. List the contents of your performance_tools subdirectory

    You should notice the following files:

    File Name Description
    Solutions Example solutions and reports
    mpi_array.f Simple MPI program used to demonstrate mpiP. Fortran version.
    mpi_array.c Simple MPI program used to demonstrate mpiP. C version.
    mpi_prime.f Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. Fortran version.
    mpi_prime.c Simple MPI program used to demonstrate gprof. Also used to demonstrate use of MPI timing routines. C version.
    tuned.c Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
    tuned.f Tuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.
    untuned.c Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. C version.
    untuned.f Untuned version of a simple four routine code used to demonstrate prof, xprofiler and hpmcount. Fortran version.

  4. MPI timing routines

    If you've never used the MPI_Wtime timing routine before, try using it to time the mpi_prime code. Determine the overall time it takes to calculate the primes within the given range. Print out the overall time and also the timer resolution.

    1. Modify the mpi_prime.c or mpi_prime.f code. You'll need to do at least the following:
      • Declare double precision variables for use with the MPI timing routines
      • Make a call to the MPI_Wtime routine before calculations begin
      • Make a call to the MPI_Wtime routine after all results are collected (from within task 0)
      • Determine overall time by subtracting the value of the first MPI_Wtime call from the second MPI_Wtime call.
      • Make a call to MPI_Wtick for the timer resolution

    2. Compile your modified code - either C or Fortran:
      mpxlc_r -lm mpi_prime.c -o mpi_prime
      mpxlf_r mpi_prime.f -o mpi_prime
      
    3. Set the necessary POE environment variables
      setenv MP_RMPOOL 1
      setenv MP_PROCS 4
      setenv MP_NODES 1
      
    4. Run the program:
      mpi_prime
      
    5. Compare your timing results against those from the same code in the Solutions directory.

  5. prof

    This example uses two simple versions of the same file - one version is untuned and one is tuned. The prof utility is used to demonstrate how easy it is to test your efforts at program optimization.

    1. First, examine the untuned version of the program (untuned.c or untuned.f) and note that four routines are called: pipe, unroll, strength, block

    2. Compile either the C or Fortran untuned program being sure to use the -p flag, which permits profiling with prof:
      xlc_r -p -lm untuned.c -o untuned
      xlf_r -p untuned.f -o untuned
      
    3. Run the untuned version.
      untuned
      
    4. After the program completes list the contents of your directory. You should have a mon.out file. This file contains the profile statistics from your run, however it is not a human readable file. You need to use the prof utility to produce a human readable profile report:
      prof > untuned.profile
      
    5. Examine your profile report (untuned.profile) and note the timings for each routine contained in the untuned program. Also note that there are system library routines in the report. For some reason, the Fortran report shows quite a number more system library routines than the C report. Example reports are in your Solutions subdirectory.

    6. Now, compile either the C or Fortran tuned version of the code:
      xlc_r -p -lm -lessl tuned.c -o tuned
      xlf_r -p -lessl tuned.f -o tuned
      
    7. Run the tuned version. As before, a mon.out file will be produced after your program completes. Then, use prof to produce a human readable profile report:
      tuned
      prof > tuned.profile
      
    8. Compare the timings of the routines in the tuned report versus the untuned report. With the exception of the pipe routine, the tuned version routines should all perform better. The comments in the tuned version of the example code briefly explain what was done to achieve the performance improvement. Example tuned profile reports are also in your Solutions directory.

    9. The compiler is generally your friend when it comes to program optimization. In this step, compile the untuned version of the code with moderate optimization:
      xlc_r -p -lm -O2 untuned.c -o untuned.opt
      xlf_r -p -O2 untuned.f -o untuned.opt
      
    10. Run the compiler optimized untuned version and produce a profile report:
      untuned.opt
      prof > untuned.opt.profile
      
      *NOTE: The C compiler optimizer currently produces wrong results for the strength and block routines.

    11. Compare your untuned.opt.profile against your previous two reports. Notice that all four routines in the example code demonstrate improved performance - in some cases even better than the tuned version without optimization.

    12. Finally, compile and run the tuned version with optimization and produce a profile report:
      xlc_r -p -lm -lessl -O2 tuned.c -o tuned.opt
      xlf_r -p -lessl -O2 tuned.f -o tuned.opt
      
      then
      tuned.opt
      prof > tuned.opt.profile
      
    13. The optimized tuned profile report should show the best timings of all. Example reports are in your Solutions subdirectory.

  6. gprof

    The gprof utility provides all of the information of the prof utility, plus information about which routines call/are called by other routines (the call graph). For large complex programs, the gprof report can be intimidating. In this exercise, we'll use the simple mpi_prime parallel program to highlight the main features of gprof.

    1. Get a fresh copy of the mpi_prime code by using the copy in your Solutions subdirectory. For either C or Fortran:
      cd ~/performance_tools
      cp Solutions/mpi_prime.c  gprof.prime.c 
      cp Solutions/mpi_prime.f  gprof.prime.f
      
    2. Compile your C or Fortran gprof.prime code with the -pg flag to permit use of gprof profiling:
      mpxlc_r -lm -pg gprof.prime.c -o gprof.prime
      mpxlf_r -pg gprof.prime.f -o gprof.prime
      
    3. Set the necessary POE environment variables
      setenv MP_RMPOOL 1
      setenv MP_PROCS 4
      setenv MP_NODES 1
      
    4. Run the program. When it completes, you should have 4 gmon.out.* files - one for each MPI task. Your execution and output should look similar to that shown below (user input is colored blue):
      smurf01% gprof.prime
         0:Using 4 tasks to scan 2500000 numbers
         0:Done. Largest prime is 2499997 Total primes 183072
         0:Wallclock time elapsed: 3.454344e+00 seconds
         0:Timer resolution= 1.333333e-08 seconds
      smurf01% ls -l gmon*
      -rw-------   1 blaise   blaise      2083444 Jul 17 17:01 gmon.out.0
      -rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.1
      -rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.2
      -rw-------   1 blaise   blaise      2083396 Jul 17 17:01 gmon.out.3
      
    5. The gmon.out files contain the execution information for each MPI task, however, they are not human readable. You must use the gprof utility in order to produce a readable profile report.

      Try any/all of the following commands to generate a gprof report for selected gmon.out.* file(s):

      gprof gprof.prime gmon.out.0 > gprof.prime.rpt
      gprof gprof.prime gmon.out.1 gmon.out.2 > gprof.prime.rpt
      gprof gprof.prime gmon.out.* > gprof.prime.rpt
      
      Note that for reports produced by multiple gmon.out.* files, the results are summed into a single report.

    6. Examine your gprof.prime.rpt. The report header explains most of what the report contains. Several things to note:
      • Both user and library functions appear in the report. There are usually quite a number of library functions of little interest.
      • Every function has a unique index number associated with it to aid in cross-referencing its information elsewhere in the report
      • Every function shows it's parent (calling) function and any children (called) functions.
      • The routines that consume the most cpu appear at the top of the report
      • Several different timing statistics are provided for each function
      • The report is actually comprised of three sections in the following order:
        1. Call Graph section
        2. Flat profile (similar to what prof produces)
        3. Index summary

        Note that you will have to scroll down quite a ways to get from one section of the report to another.

  7. xprofiler

    The xprofiler tool provides a graphical representation of the information produced by the gprof utility. In fact, it uses the same gmon.out.* files produced when a program is compiled with the -pg flag. Additionally, xprofiler provides text reports of gprof information and can profile "ticks" at the source code line level.

    1. Using either the C or Fortran version of the "tuned" program, compile it with both the -pg and -g flags to produce an a.out executable file.
      xlc_r -pg -g -lm -lessl tuned.c 
      xlf_r -pg -g -lessl tuned.f
      
    2. Run your a.out program as usual. It should produce a gmon.out file after it has completed.

    3. Make sure your X-Windows environment is setup correctly, and then start xprofiler:
      xprofiler a.out gmon.out &
      
    4. If everything is setup correctly, the xprofiler GUI will appear loaded with your code. Example here.

    5. Notice that functions/routines are "clustered" into boxes. In the typical case, system and other libraries will comprise a significant part of the display. For the purposes of this exercise, we are not interested in these. To remove the libraries' information from the display, select the "Hide All Library Calls" option from the "Filter" pull-down menu. Example here.

    6. After hiding all library information, your xprofiler display should show only the clustered function box for a.out (your program). Example here.

    7. To view detailed information about a.out, you will usually need to do the following:
      • Select "Uncluster Functions" from the "Filter" pull-down menu. Example here.
      • This will remove the "box" from around a.out routines. Example here.
      • Select "Zoom In" from the "View" pull-down menu. Example here.
      • Your mouse icon will turn into a pointing hand. Just left-click anywhere within your a.out display area. Detailed information about the functions and routines in a.out will then appear. Example here.

    8. Review the functions and arcs displayed by xprofiler. Functions are represented by green boxes and arcs by blue lines that connect functions. Some points to note:
      • The size/shape of a function box represents its execution times, which are also shown as text. The width/first number represents execution time due to descendents. The height/second number represents execution time due to that function by itself. If the function has no descendents, then it will be square and both timing numbers will be the same.
      • Arcs show which function calls another function and how many times.
      • Example 1: The main routine takes 42.300 seconds to execute, all of which is due to the four routines it calls.
      • Example 2: The block routine takes 0.070 seconds to execute, all of which is due to itself. The strength routine takes 1.990 seconds to execute, all due to itself even though it calls another routine. The pipe routine takes 3.760 seconds to execute, all due to itself.
      • Example 3: The sin routine is called 50000 times by the strength routine, but takes virtually no recorded time to execute. The big unroll routine takes takes 36.480 seconds to execute, all of which is due to itself.

    9. Functions and arcs have "hidden" menus, which pop-open when clicked on with the right mouse button. Right mouse click on any function box to view its hidden menu. Example here

    10. Select "Show Source Code" from the unroll function menu. A new window will appear showing the source code. Example here

    11. Scroll through the source code and view the "ticks" wherever they appear. Each "tick" represents 1/100th of a second, so multiplying the number of ticks times 100 approximates the amount of time spent on a given source line. Example here

    12. Right mouse click on the strength routine and then select "Statistics Report" from its hidden menu. Example here

    13. A new window will appear showing execution statistics about the strength function. Example here

    14. Left mouse click on xprofiler's "Report" pull-down menu to see what plain text reports are available. Example here

    15. Try viewing any/all of the available reports. An example of each is provided below.

    16. Review xprofiler's other features as desired, or exit it whenever you wish.

  8. mpiP

    mpiP is a lightweight profiling library for MPI routines, providing statistical information about the performance of your program's MPI routine calls. mpiP is LLNL developed beta software installed on ASCI systems under /usr/local/mpiP.

    1. Review the README file located with the software under /usr/local/mpiP.

    2. Compile either the C or Fortran example file so that it can be used with mpiP:
      mpxlc_r -g -o array mpi_array.c -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
      mpxlf_r -o array mpi_array.f -g -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
      

    3. Set the necessary POE environment variables:
      setenv MP_RMPOOL 1
      setenv MP_PROCS 8
      setenv MP_NODES 2
      
    4. Run the array program. If mpiP is linked in correctly, you will notice mpiP messages intermixed with the program output. In the example output below, mpiP output messages are colored blue.

      smurf01% array
         0:mpiP:
         0:mpiP:
         0:mpiP: mpiP V2.3 (Build Aug 28 2001/11:55:57)
         0:mpiP: Direct questions and errors to Jeffrey Vetter 
         0:mpiP:
         0:MPI task 0 has started...
         4:MPI task 4 has started...
         1:MPI task 1 has started...
         5:MPI task 5 has started...
         2:MPI task 2 has started...
         6:MPI task 6 has started...
         3:MPI task 3 has started...
         7:MPI task 7 has started...
         0:Initialized array sum = 1.335708e+14
         0:Sent 2000000 elements to task 1 offset= 2000000
         0:Sent 2000000 elements to task 2 offset= 4000000
         0:Sent 2000000 elements to task 3 offset= 6000000
         1:Task 1 mysum = 1.196314e+13
         0:Sent 2000000 elements to task 4 offset= 8000000
         2:Task 2 mysum = 2.023766e+13
         0:Sent 2000000 elements to task 5 offset= 10000000
         3:Task 3 mysum = 2.784929e+13
         0:Sent 2000000 elements to task 6 offset= 12000000
         4:Task 4 mysum = 3.593385e+13
         0:Sent 2000000 elements to task 7 offset= 14000000
         5:Task 5 mysum = 4.467700e+13
         6:Task 6 mysum = 5.218496e+13
         0:Task 0 mysum = 4.000093e+12
         7:Task 7 mysum = 5.942945e+13
         0:Sample results:
         0:  0.000000e+00  2.000000e+00  4.000000e+00  6.000000e+00  8.000000e+00
         0:  4.000000e+06  4.000002e+06  4.000004e+06  4.000006e+06  4.000008e+06
         0:  8.000000e+06  8.000002e+06  8.000004e+06  8.000006e+06  8.000008e+06
         0:  1.200000e+07  1.200000e+07  1.200000e+07  1.200001e+07  1.200001e+07
         0:  1.600000e+07  1.600000e+07  1.600000e+07  1.600001e+07  1.600001e+07
         0:  2.000000e+07  2.000000e+07  2.000000e+07  2.000001e+07  2.000001e+07
         0:  2.400000e+07  2.400000e+07  2.400000e+07  2.400001e+07  2.400001e+07
         0:  2.800000e+07  2.800000e+07  2.800000e+07  2.800001e+07  2.800001e+07
         0:*** Final sum= 2.562754e+14 ***
         0:mpiP:
         0:mpiP: found 21575 symbols in file [array]
         0:mpiP:
         0:mpiP: Storing mpiP output in [./array.8.43046.mpiP].
         0:mpiP:
      

    5. After the example code completes, review the mpiP report file. The output file name will have the format of filename.N.XXXXX.mpiP where filename=executable name, N=#MPI tasks and XXXXX=collector task process id.

    6. Examine the output file and notice what information is profiled for the MPI routines. Example C and Fortran mpiP output reports are provided in your Solutions subdirectory.

  9. hpmcount

    The hpmcount command-line utility is used to launch a specified program, capture run-time hardware performance / utilization information, and provide summary statistical output. It is based upon an architecture's physical hardware counters and hardware events.

    For this exercise, we will use the version of hpmcount that works with the IBM 604e architecture, since the machines we are using are of that type. Note that using hpmcount on ASCI White differs from what is shown here (but not by much).

    1. First, set up an alias to use the correct version of the hpmcount utility for the IBM 604e architecture:
      alias hpmcount /usr/local/HPM/hpmcount604
      

    2. Compile your untuned code, either C or Fortran, to produce an a.out file:
      xlc_r -lm untuned.c
      xlf_r untuned.f
      
    3. Run your a.out file using hpmcount with the default set of hardware counters:
      hpmcount a.out > untuned.hpmcount.rpt
      
    4. Examine your output report. It should resemble that shown below:

       adding counter 0 event 15 FPU instructions
       adding counter 1 event 6 Data Cache Misses
       adding counter 2 event 2 Instructions completed
       adding counter 3 event 1 Cycles
      
      Running pipe()...
      s = 7.980000e+10
      Running unroll()...
      s= 3.349600e+10
      Running strength()...
      s= 8.429910e+04
      Running block()...
      c[511][511] = 6.684706e+07
      
       hpmcount (V 2.1.0) summary
      
       Total execution time (wall clock time): 116.626656 seconds
      
       ########  Resource Usage Statistics  ########
      
       Total amount of time in user mode            : 116.420000 seconds
       Total amount of time in system mode          : 0.180000 seconds
       Maximum resident set size                    : 7552 Kbytes
       Average shared memory use in text segment    : 139884 Kbytes*sec
       Average unshared memory use in data segment  : 45828508 Kbytes*sec
       Number of page faults without I/O activity   : 1898
       Number of page faults with I/O activity      : 0
       Number of times process was swapped out      : 0
       Number of times file system performed INPUT  : 0
       Number of times file system performed OUTPUT : 0
       Number of IPC messages sent                  : 0
       Number of IPC messages received              : 0
       Number of signals delivered                  : 0
       Number of voluntary context switches         : 41
       Number of involuntary context switches       : 290
      
       #######  End of Resource Statistics  ########
      
        PM_FPU_CMPL (FPU instructions)             :      1735013853
        PM_DC_MISS (Data Cache Misses)             :       386777959
        PM_INST_CMPL (Instructions completed)      :     21418551093
        PM_CYC (Cycles)                            :     38050190097
      
        Cycles per instruction                     :           1.777
        Instructions per cycle                     :           0.563
      

    5. Now, compile your tuned version of the code and run hpmcount with it.
      xlc_r -lm -lessl tuned.c
      xlf_r -lessl untuned.f
      
      then
      hpmcount a.out > tuned.hpmcount.rpt
      

    6. Compare the two reports (untuned versus tuned) and notice the more efficient use of the 604e hardware by the tuned code. An example tuned hpmcount report is available. In particular notice:
      • Less execution time
      • Fewer floating point instructions
      • Less data cache misses
      • Fewer overall instructions required
      • Fewer overall cycles used
      • Lower cycles per instruction ratio
      • Higher instructions per cycle ratio


This completes the exercise.

Evaluation Form       Please complete the online evaluation form.

Where would you like to go now?