This brief exercise covers only several of the tools presented in the
tutorial. Individual workshops are offered for the more sophisticated
tools, such as VampirGuideView, Paraver, Dimemas, etc.
- Login to the SP machine
Workshops differ in how this is done. The instructor will go over
this beforehand.
- Copy the example files
In your home directory, create a subdirectory for the example codes
and cd to it. Then copy the example codes.
mkdir ~/performance_tools
cd ~/performance_tools
cp -R /usr/local/spclass/blaise/performance_tools/samples/* ~/performance_tools
- List the contents of your performance_tools
subdirectory
You should notice the following files:
File Name |
Description |
Solutions
Example solutions and reports
|
mpi_array.f
Simple MPI program used to demonstrate mpiP. Fortran version.
|
mpi_array.c
Simple MPI program used to demonstrate mpiP. C version.
|
mpi_prime.f
Simple MPI program used to demonstrate gprof. Also used to
demonstrate use of MPI timing routines. Fortran version.
|
mpi_prime.c
Simple MPI program used to demonstrate gprof. Also used to
demonstrate use of MPI timing routines. C version.
|
tuned.c
Tuned version of a simple four routine code used to demonstrate prof,
xprofiler and hpmcount. C version.
|
tuned.f
Tuned version of a simple four routine code used to demonstrate prof,
xprofiler and hpmcount. Fortran version.
|
untuned.c
Untuned version of a simple four routine code used to demonstrate prof,
xprofiler and hpmcount. C version.
|
untuned.f
Untuned version of a simple four routine code used to demonstrate prof,
xprofiler and hpmcount. Fortran version.
| | | | | | | | | |
- MPI timing routines
If you've never used the MPI_Wtime timing routine before,
try using it to time the mpi_prime code. Determine the overall
time it takes to calculate the primes within the given range.
Print out the overall time and also the timer resolution.
- Modify the mpi_prime.c or mpi_prime.f code. You'll need to do
at least the following:
- Declare double precision variables for use with the MPI
timing routines
- Make a call to the MPI_Wtime routine before calculations
begin
- Make a call to the MPI_Wtime routine after all results are
collected (from within task 0)
- Determine overall time by subtracting the value of the first
MPI_Wtime call from the second MPI_Wtime call.
- Make a call to MPI_Wtick for the timer resolution
- Compile your modified code - either C or Fortran:
mpxlc_r -lm mpi_prime.c -o mpi_prime
mpxlf_r mpi_prime.f -o mpi_prime
- Set the necessary POE environment variables
setenv MP_RMPOOL 1
setenv MP_PROCS 4
setenv MP_NODES 1
- Run the program:
mpi_prime
- Compare your timing results against those from the same code in the
Solutions directory.
- prof
This example uses two simple versions of the same file - one version is
untuned and one is tuned. The prof utility is used to demonstrate how
easy it is to test your efforts at program optimization.
- First, examine the untuned version of the program (untuned.c or untuned.f)
and note that four routines are called:
pipe, unroll, strength, block
- Compile either the C or Fortran untuned program being sure to use the
-p flag, which permits profiling with prof:
xlc_r -p -lm untuned.c -o untuned
xlf_r -p untuned.f -o untuned
- Run the untuned version.
untuned
- After the program completes list the contents of your directory. You should
have a mon.out file. This file contains the profile statistics from
your run, however it is not a human readable file. You need to use the
prof utility to produce a human readable profile report:
prof > untuned.profile
- Examine your profile report (untuned.profile) and note the timings for each
routine contained in the untuned program. Also note that there are system
library routines in the report. For some reason, the Fortran report shows
quite a number more system library routines than the C report.
Example reports are in your Solutions subdirectory.
- Now, compile either the C or Fortran tuned version of the code:
xlc_r -p -lm -lessl tuned.c -o tuned
xlf_r -p -lessl tuned.f -o tuned
- Run the tuned version. As before, a mon.out file will be produced after
your program completes. Then, use prof to produce a human
readable profile report:
tuned
prof > tuned.profile
- Compare the timings of the routines in the tuned report versus the untuned
report. With the exception of the pipe routine, the tuned version
routines should all perform better. The comments in the
tuned version of the example code briefly explain what was done to achieve
the performance improvement. Example tuned profile reports are also in your
Solutions directory.
- The compiler is generally your friend when it comes to program optimization.
In this step, compile the untuned version of the code with moderate
optimization:
xlc_r -p -lm -O2 untuned.c -o untuned.opt
xlf_r -p -O2 untuned.f -o untuned.opt
- Run the compiler optimized untuned version and produce a profile report:
untuned.opt
prof > untuned.opt.profile
*NOTE: The C compiler optimizer currently produces
wrong results for the strength and block routines.
- Compare your untuned.opt.profile against your previous two reports.
Notice that all four routines in the example code demonstrate improved
performance - in some cases even better than the tuned version without
optimization.
- Finally, compile and run the tuned version with optimization and produce a
profile report:
xlc_r -p -lm -lessl -O2 tuned.c -o tuned.opt
xlf_r -p -lessl -O2 tuned.f -o tuned.opt
then
tuned.opt
prof > tuned.opt.profile
- The optimized tuned profile report should show the best timings of all.
Example reports are in your Solutions subdirectory.
- gprof
The gprof utility provides all of the information of the prof
utility, plus information about which routines call/are called by other
routines (the call graph). For large complex programs, the gprof report can be
intimidating. In this exercise, we'll use the simple mpi_prime
parallel program to highlight the main features of gprof.
- Get a fresh copy of the mpi_prime code by using the copy in your Solutions
subdirectory. For either C or Fortran:
cd ~/performance_tools
cp Solutions/mpi_prime.c gprof.prime.c
cp Solutions/mpi_prime.f gprof.prime.f
- Compile your C or Fortran gprof.prime code with the -pg flag
to permit use of gprof profiling:
mpxlc_r -lm -pg gprof.prime.c -o gprof.prime
mpxlf_r -pg gprof.prime.f -o gprof.prime
- Set the necessary POE environment variables
setenv MP_RMPOOL 1
setenv MP_PROCS 4
setenv MP_NODES 1
- Run the program. When it completes, you should have 4 gmon.out.* files -
one for each MPI task. Your execution and output should look similar
to that shown below (user input is colored blue):
smurf01% gprof.prime
0:Using 4 tasks to scan 2500000 numbers
0:Done. Largest prime is 2499997 Total primes 183072
0:Wallclock time elapsed: 3.454344e+00 seconds
0:Timer resolution= 1.333333e-08 seconds
smurf01% ls -l gmon*
-rw------- 1 blaise blaise 2083444 Jul 17 17:01 gmon.out.0
-rw------- 1 blaise blaise 2083396 Jul 17 17:01 gmon.out.1
-rw------- 1 blaise blaise 2083396 Jul 17 17:01 gmon.out.2
-rw------- 1 blaise blaise 2083396 Jul 17 17:01 gmon.out.3
- The gmon.out files contain the execution information for each MPI task,
however, they are not human readable. You must use the gprof
utility in order to produce a readable profile report.
Try any/all of the following commands to generate a gprof report
for selected gmon.out.* file(s):
gprof gprof.prime gmon.out.0 > gprof.prime.rpt
gprof gprof.prime gmon.out.1 gmon.out.2 > gprof.prime.rpt
gprof gprof.prime gmon.out.* > gprof.prime.rpt
Note that for reports produced by multiple gmon.out.* files, the results are
summed into a single report.
- Examine your gprof.prime.rpt. The report header explains most of what the
report contains. Several things to note:
- xprofiler
The xprofiler tool provides a graphical representation of the information
produced by the gprof utility. In fact, it uses the same gmon.out.*
files produced when a program is compiled with the -pg flag.
Additionally, xprofiler provides text reports of gprof information
and can profile "ticks" at the source code line level.
- Using either the C or Fortran version of the "tuned" program, compile
it with both the -pg and -g flags to
produce an a.out executable file.
xlc_r -pg -g -lm -lessl tuned.c
xlf_r -pg -g -lessl tuned.f
- Run your a.out program as usual. It should produce a gmon.out file
after it has completed.
- Make sure your X-Windows environment is setup correctly, and then start
xprofiler:
xprofiler a.out gmon.out &
- If everything is setup correctly, the xprofiler GUI will appear loaded with
your code.
Example here.
- Notice that functions/routines are "clustered" into boxes. In the typical
case, system and other libraries will comprise a significant part of the
display. For the purposes of this exercise, we are not interested in these.
To remove the libraries' information from the display, select the
"Hide All Library Calls" option from the "Filter" pull-down menu.
Example here.
- After hiding all library information, your xprofiler display should show
only the clustered function box for a.out (your program).
Example here.
- To view detailed information about a.out, you will usually need to do the
following:
- Select "Uncluster Functions" from the "Filter" pull-down menu.
Example here.
- This will remove the "box" from around a.out routines.
Example here.
- Select "Zoom In" from the "View" pull-down menu.
Example here.
- Your mouse icon will turn into a pointing hand. Just left-click anywhere
within your a.out display area. Detailed information about the functions
and routines in a.out will then appear.
Example here.
- Review the functions and arcs displayed by xprofiler. Functions are
represented by green boxes and arcs by blue lines that connect functions.
Some points to note:
- The size/shape of a function box represents its execution times, which
are also shown as text. The width/first number represents execution time
due to descendents. The height/second number represents execution time
due to that function by itself. If the function has no descendents, then
it will be square and both timing numbers will be the same.
- Arcs show which function calls another function and how many times.
- Example 1: The
main routine takes 42.300 seconds to execute, all of which is due to
the four routines it calls.
- Example 2: The
block routine takes 0.070 seconds to execute, all of which is due to
itself. The strength routine takes 1.990 seconds to execute, all due
to itself even though it calls another routine. The pipe routine takes
3.760 seconds to execute, all due to itself.
- Example 3: The
sin routine is called 50000 times by the strength routine, but takes
virtually no recorded time to execute. The big unroll routine takes
takes 36.480 seconds to execute, all of which is due to itself.
- Functions and arcs have "hidden" menus, which pop-open when clicked on with
the right mouse button. Right mouse click on any function box to view its
hidden menu.
Example here
- Select "Show Source Code" from the unroll function menu. A new window will
appear showing the source code.
Example here
- Scroll through the source code and view the "ticks" wherever they appear.
Each "tick" represents 1/100th of a second, so multiplying the number of
ticks times 100 approximates the amount of time spent on a given source
line.
Example here
- Right mouse click on the strength routine and then select "Statistics
Report" from its hidden menu.
Example here
- A new window will appear showing execution statistics about the strength
function.
Example here
- Left mouse click on xprofiler's "Report" pull-down menu to see what
plain text reports are available.
Example here
- Try viewing any/all of the available reports. An example of each
is provided below.
- Review xprofiler's other features as desired, or exit it whenever you wish.
- mpiP
mpiP is a lightweight profiling library for MPI routines, providing statistical
information about the performance of your program's MPI routine calls. mpiP is
LLNL developed beta software installed on ASCI systems under /usr/local/mpiP.
- Review the README file located with the software under /usr/local/mpiP.
- Compile either the C or Fortran example file so that it can be used with
mpiP:
mpxlc_r -g -o array mpi_array.c -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
mpxlf_r -o array mpi_array.f -g -L/usr/local/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
- Set the necessary POE environment variables:
setenv MP_RMPOOL 1
setenv MP_PROCS 8
setenv MP_NODES 2
- Run the array program. If mpiP is linked in correctly, you will notice mpiP
messages intermixed with the program output. In the example output below,
mpiP output messages are colored blue.
smurf01% array
0:mpiP:
0:mpiP:
0:mpiP: mpiP V2.3 (Build Aug 28 2001/11:55:57)
0:mpiP: Direct questions and errors to Jeffrey Vetter
0:mpiP:
0:MPI task 0 has started...
4:MPI task 4 has started...
1:MPI task 1 has started...
5:MPI task 5 has started...
2:MPI task 2 has started...
6:MPI task 6 has started...
3:MPI task 3 has started...
7:MPI task 7 has started...
0:Initialized array sum = 1.335708e+14
0:Sent 2000000 elements to task 1 offset= 2000000
0:Sent 2000000 elements to task 2 offset= 4000000
0:Sent 2000000 elements to task 3 offset= 6000000
1:Task 1 mysum = 1.196314e+13
0:Sent 2000000 elements to task 4 offset= 8000000
2:Task 2 mysum = 2.023766e+13
0:Sent 2000000 elements to task 5 offset= 10000000
3:Task 3 mysum = 2.784929e+13
0:Sent 2000000 elements to task 6 offset= 12000000
4:Task 4 mysum = 3.593385e+13
0:Sent 2000000 elements to task 7 offset= 14000000
5:Task 5 mysum = 4.467700e+13
6:Task 6 mysum = 5.218496e+13
0:Task 0 mysum = 4.000093e+12
7:Task 7 mysum = 5.942945e+13
0:Sample results:
0: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00
0: 4.000000e+06 4.000002e+06 4.000004e+06 4.000006e+06 4.000008e+06
0: 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06
0: 1.200000e+07 1.200000e+07 1.200000e+07 1.200001e+07 1.200001e+07
0: 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07
0: 2.000000e+07 2.000000e+07 2.000000e+07 2.000001e+07 2.000001e+07
0: 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07
0: 2.800000e+07 2.800000e+07 2.800000e+07 2.800001e+07 2.800001e+07
0:*** Final sum= 2.562754e+14 ***
0:mpiP:
0:mpiP: found 21575 symbols in file [array]
0:mpiP:
0:mpiP: Storing mpiP output in [./array.8.43046.mpiP].
0:mpiP:
|
- After the example code completes, review the mpiP report file.
The output file name will have the format of filename.N.XXXXX.mpiP
where filename=executable name, N=#MPI tasks and XXXXX=collector task
process id.
- Examine the output file and notice what information is profiled for the
MPI routines. Example C and Fortran mpiP output reports are provided in
your Solutions subdirectory.
- hpmcount
The hpmcount command-line utility is used to launch a specified program, capture run-time hardware performance / utilization information, and provide summary statistical output. It is based upon an architecture's physical hardware counters and hardware events.
For this exercise, we will use the version of hpmcount that works with the IBM
604e architecture, since the machines we are using are of that type. Note that
using hpmcount on ASCI White differs from what is shown here (but not by much).
- First, set up an alias to use the correct version of the hpmcount utility
for the IBM 604e architecture:
alias hpmcount /usr/local/HPM/hpmcount604
- Compile your untuned code, either C or Fortran, to produce an a.out file:
xlc_r -lm untuned.c
xlf_r untuned.f
- Run your a.out file using hpmcount with the default set of hardware
counters:
hpmcount a.out > untuned.hpmcount.rpt
- Examine your output report. It should resemble that shown below:
adding counter 0 event 15 FPU instructions
adding counter 1 event 6 Data Cache Misses
adding counter 2 event 2 Instructions completed
adding counter 3 event 1 Cycles
Running pipe()...
s = 7.980000e+10
Running unroll()...
s= 3.349600e+10
Running strength()...
s= 8.429910e+04
Running block()...
c[511][511] = 6.684706e+07
hpmcount (V 2.1.0) summary
Total execution time (wall clock time): 116.626656 seconds
######## Resource Usage Statistics ########
Total amount of time in user mode : 116.420000 seconds
Total amount of time in system mode : 0.180000 seconds
Maximum resident set size : 7552 Kbytes
Average shared memory use in text segment : 139884 Kbytes*sec
Average unshared memory use in data segment : 45828508 Kbytes*sec
Number of page faults without I/O activity : 1898
Number of page faults with I/O activity : 0
Number of times process was swapped out : 0
Number of times file system performed INPUT : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent : 0
Number of IPC messages received : 0
Number of signals delivered : 0
Number of voluntary context switches : 41
Number of involuntary context switches : 290
####### End of Resource Statistics ########
PM_FPU_CMPL (FPU instructions) : 1735013853
PM_DC_MISS (Data Cache Misses) : 386777959
PM_INST_CMPL (Instructions completed) : 21418551093
PM_CYC (Cycles) : 38050190097
Cycles per instruction : 1.777
Instructions per cycle : 0.563
|
- Now, compile your tuned version of the code and run hpmcount with it.
xlc_r -lm -lessl tuned.c
xlf_r -lessl untuned.f
then
hpmcount a.out > tuned.hpmcount.rpt
- Compare the two reports (untuned versus tuned) and notice the more efficient
use of the 604e hardware by the tuned code. An
example tuned hpmcount
report is available. In particular notice:
- Less execution time
- Fewer floating point instructions
- Less data cache misses
- Fewer overall instructions required
- Fewer overall cycles used
- Lower cycles per instruction ratio
- Higher instructions per cycle ratio