Performance Analysis Tools

Blaise Barney, Lawrence Livermore National Laboratory

UCRL-MI-133316

NOTE: This information pertains to retired LC systems and is being kept for archival purposes only.

Abstract
Scope and Motivation
Performance Considerations and Strategies
Timers
1. time
2. timex
3. gettimeofday()
4. MPI Timing Routines
5. system_clock()
6. read_real_time()
7. IBM Fortran Routines - rtc, irtc, dtime, etime, mclock, timef
Profilers
Performance Analysis Tools
Miscellaneous Tools
References and More Information
Exercise

Abstract

An essential prerequisite for optimizing an application is to first understand its execution characteristics. A number of tools are available for the application developer to accomplish this, ranging from simple shell utilities, timers and profilers, trace analysis tools, to sophisticated full featured graphical toolsets. This tutorial investigates, in varying depths, a number of tools that can be used to analyze an application's performance towards the goals of optimization and trouble-shooting. A lab exercise featuring a subset of these tools is provided.

Level/Prerequisites: A basic understanding of parallel programming in C or Fortran is assumed.

Scope and Motivation

Scope of This Tutorial:

A variety of profiling and execution analysis tools exist for both serial and parallel programs. They range widely in usefulness and complexity:
- Simple command line timing utilities
- Fortran and C timing routines
- Profilers
- Execution trace generators
- Graphical execution analyzers - with/without trace generation
- Both real-time and post-execution tools
Most of the more sophisticated and useful tools have a learning curve associated with them and would deserve a full day tutorial themselves.
The purpose of this tutorial is to briefly review a range of performance analysis tools available to Livermore Computing users, and to provide pointers for more information.
Note that not all tools are cross-platform - some tools are platform specific.

Motivation:

Writing large-scale parallel and distributed scientific applications that make optimum use of computational resources is a challenging problem. Very often, resources are under-utilized or used inefficiently.

The factors which determine a program's performance are complex, interrelated, and oftentimes, hidden from the programmer. Some of them are listed by category below.

Application Related Factors Hardware Related Factors Software Related Factors

Algorithms
Dataset Sizes
Memory Usage Patterns
Use of I/O
Communication Patterns
Task Granularity
Load Balancing
Amdahl's Law

Processor Architecture
Memory Hierarchy
I/O Configuration
Network

Operating system
Compiler
Preprocessor
Communication protocols
Libraries

Because of these challenges and complexities, performance analysis tools are essential to optimizing an application's performance. They can assist you in understanding what your program is "really doing" and suggest how program performance should be improved.

Application Related Factors	Hardware Related Factors	Software Related Factors
Algorithms Dataset Sizes Memory Usage Patterns Use of I/O Communication Patterns Task Granularity Load Balancing Amdahl's Law	Processor Architecture Memory Hierarchy I/O Configuration Network	Operating system Compiler Preprocessor Communication protocols Libraries

Performance Considerations and Strategies

The most important goal of performance tuning is to reduce a program's wall clock execution time. Reducing resource usage in other areas, such as memory or disk requirements, may also be a tuning goal.
Performance tuning is an iterative process used to optimize the efficiency of a program. It usually involves finding your program's hot spots and eliminating the bottlenecks in them.
- Hot Spot: An area of code within the program that uses a disproportionately high amount of processor time.
- Bottleneck: An area of code within the program that uses processor resources inefficiently and therefore causes unnecessary delays.
Performance tuning usually involves profiling - using software tools to measure a program's run-time characteristics and resource utilization.
Use profiling tools and techniques to learn which areas of your code offer the greatest potential performance increase BEFORE you start the tuning process. Then, target the most time consuming and frequently executed portions of a program for optimization.
Consider optimizing your underlying algorithm: an extremely fine-tuned O(N * N) sorting algorithm may perform significantly worse than a untuned O(N log N) algorithm.
For data dependent computations, benchmark based on a variety of realistic (both size and values) input data sets. Maintain consistent input data during the fine-tuning process.
Take advantage of compiler and preprocessor optimizations when possible.
Finally, know when to stop - there are diminishing returns in successive optimizations. Consider a program with the following breakdown of execution time percentages for the associated parts of the program:

Procedure % CPU Time
main() procedure1() procedure2() procedure3() 13% 17% 20% 50%

A 20% increase in the performance of procedure3() results in a 10% performance increase overall.
A 20% increase in the performance of main() results in only a 2.6% performance increase overall.

Procedure	% CPU Time
`main() procedure1() procedure2() procedure3()`	`13% 17% 20% 50%`

Timers

A wide variety of timers are available:
- Shell commands
- Fortran/C/C++ subroutines
- Library (such as MPI) routines
- System subroutines
Portability varies

The table below compares a number of these. A detailed description for each follows.

Timer Usage Wallclock / CPU Time Resolution
(seconds) Languages Portable?
time shell / script both 10^-2 / 10^-3 n/a yes
timex shell / script both 10^-2 / 10^-3 n/a yes
gettimeofday subroutine wallclock 10^-6 C/C++ yes
MPI_Wtime subroutine wallclock 10^-6 C/C++, Fortran yes
system_clock subroutine wallclock varies Fortran90 yes
read_real_time
read_wall_time
time_base_to_time subroutine wallclock 10^-9 IBM AIX C/C++ no
rtc subroutine wallclock 10^-6 IBM Fortran no
irtc subroutine wallclock 10^-9 IBM Fortran no
dtime_ subroutine CPU 10^-2 IBM Fortran no
etime_ subroutine CPU 10^-2 IBM Fortran no
mclock subroutine CPU 10^-2 IBM Fortran no
timef subroutine wallclock 10^-3 IBM Fortran no

Timer	Usage	Wallclock / CPU Time	Resolution (seconds)	Languages	Portable?
time	shell / script	both	10^-2 / 10^-3	n/a	yes
timex	shell / script	both	10^-2 / 10^-3	n/a	yes
gettimeofday	subroutine	wallclock	10^-6	C/C++	yes
MPI_Wtime	subroutine	wallclock	10^-6	C/C++, Fortran	yes
system_clock	subroutine	wallclock	varies	Fortran90	yes
read_real_time read_wall_time time_base_to_time	subroutine	wallclock	10^-9	IBM AIX C/C++	no
rtc	subroutine	wallclock	10^-6	IBM Fortran	no
irtc	subroutine	wallclock	10^-9	IBM Fortran	no
dtime_	subroutine	CPU	10^-2	IBM Fortran	no
etime_	subroutine	CPU	10^-2	IBM Fortran	no
mclock	subroutine	CPU	10^-2	IBM Fortran	no
timef	subroutine	wallclock	10^-3	IBM Fortran	no

time

The time command returns the total execution time of your program or command.
The format of the output depends upon your shell (/bin/csh, /bin/tcsh, /bin/ksh, /bin/bash...). The basic information is :
- Real time: the total wall clock (start to finish) time your program took to load, execute, and exit.
- User time: the total amount of CPU time your program took to execute.
- System time: the amount of CPU time spent on operating system calls in executing your program.
The system and user times are defined differently across different computer architectures.
Example csh time output:
1 2 3 4 5 6 7 8 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w
Explanation:
1. 1.15 seconds of user CPU time
2. 0.02 seconds of system (kernel) time used on behalf of user
3. 1.76 seconds real time (wall clock time)
4. 66.4% total CPU time (user+system) during execution as a percentage of elapsed time.
5. 15 Kbytes of shared memory usage and 3981 Kbytes of unshared data space
6. 24 block input operations and 10 block output operations
7. no page faults
8. no swaps
Example ksh/bash time output:
real 0m2.58s user 0m1.14s sys 0m0.03s
Explanation:
- 0 minutes, 2.58 seconds of wall clock time
- 0 minutes, 1.14 seconds of user CPU time
- 0 minutes, 0.03 seconds of system CPU time
See the time man page for details.

timex

Without options, the timex command provides the same type of information as the ksh/bash version of the time command.
timex reports its information in a common format across a variety of shells.
timex also has several flags/options
The -s option causes the timex command to display a wide variety of additional system related information, including:
- I/O statistics
- caching effectiveness
- system calls
- kernel process activity
- message and semaphore activity
- queue statistics
- paging statistics
- system unit activity
- system switching activity
- tty device activity
See the timex man page for details.

Note that much of the timex command's output is NOT described in the timex man page. Some of it may be understood from reading the sar command man page.

gettimeofday()

The gettimeofday() routine is part of the Standard C Library (libc.a) on most Unix systems. It returns the time in seconds and microseconds since midnight, January 1, 1970.
Can be inserted anywhere within a C program and used to determine the start and end times of code fragments.
Actual timer resolution is hardware dependent.

See the gettimeofday man page for details.

Example Code Fragment

#include <time.h> #include <sys/time.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]); { struct timeval start, end; gettimeofday(&start;, NULL); /* do some work */ gettimeofday(&end;, NULL); printf("Total time was %ld uSec.\n", ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))); return 0; }

MPI Timing Routines

MPI includes a built in timing routine, called MPI_Wtime, which returns the elapsed wall clock time in seconds (double precision) on the calling processor.

MPI also includes a routine, called MPI_Wtick, which returns the resolution in seconds (double precision) of MPI_Wtime. Resolution on IBM SP systems varies by model but is microsecond level or better.

Example Codes (C and Fortran)

#include "mpi.h" int main(int argc, char *argv[]); { int n=0,m; double start,end,resolution; MPI_Init(&argc;,&argv;); start = MPI_Wtime(); /* start time */ for (m=0;m<2000000;m++) n = n + m; end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("Wallclock times(secs): start= %f end= %f \n",start,end); printf("elapsed= %e resolution= %e\n", end-start,resolution); MPI_Finalize (); }
program wtime_test include 'mpif.h' integer n,m,ierr double precision start,end,resolution call MPI_INIT(ierr) start = MPI_WTIME() do m=1,2000000 n = n + m end do end = MPI_WTIME() resolution = MPI_WTICK() print *,'Wallclock times(secs): start= ',start,'end=',end print *,'elapsed=',end-start,'resolution=',resolution call MPI_FINALIZE (ierr) end

system_clock()

Fortran90 provides the system_clock routine, which should be portable across platforms and compilers.
The way the routine is implemented will vary however. For example, the clock rate can vary from 100 to 1,000,000 across LC machines.
The routine has three optional arguments, all of which are shown in the example below.

Note: when COUNT_MAX is reached, the counter will roll-over. Actual timings should check and adjust for this.

Example Code Fragment

PROGRAM SYSCLOCK INTEGER :: T1, T2, max, rate REAL :: T3 CALL SYSTEM_CLOCK(COUNT=T1) ! Start timing ! ...Do your calculation here, for example:... DO M = 1,882000000 N = N + M END DO CALL SYSTEM_CLOCK(COUNT=T2) ! Stop timing PRINT *, "Start: ",T1," End: ",T2," Count= ",T2-T1 ! Calculate the elapsed time in seconds: CALL SYSTEM_CLOCK(COUNT_RATE=rate, COUNT_MAX=max) ! Find the rate and max PRINT *, "Rate= ",rate," Max= ",max T3=(T2-T1)/REAL(rate) PRINT *, "Elapsed time = ",T3, " seconds" END

read_real_time()
read_wall_time()
time_base_to_time()

IBM AIX only - part of the Standard C Library (libc.a)
Designed to be used for making high-resolution measurement of elapsed time, using the processor real time clock or time base registers.
Nanosecond resolution on the IBM SP.
The time_base_to_time() routine is used to guarantee correct time units across different IBM RS/6000 architectures.

See the man page for details.

Example Code

#include <stdio.h> #include <sys/time.h> int main(void) { timebasestruct_t start, finish; int secs, n_secs; read_real_time(&start;, TIMEBASE_SZ); /* do some work */ read_real_time(&finish;, TIMEBASE_SZ); /* Make sure both values are in seconds and nanoseconds */ time_base_to_time(&start;, TIMEBASE_SZ); time_base_to_time(&finish;, TIMEBASE_SZ); /* Subtract the starting time from the ending time */ secs = finish.tb_high - start.tb_high; n_secs = finish.tb_low - start.tb_low; /* Fix carry from low-order to high-order during the measurement */ if (n_secs < 0) { secs--; n_secs += 1000000000; } (void) printf("Sample time was %d seconds %d nanoseconds\n", secs, n_secs); exit(0); }

xlf Fortran Timing Routines

The routines described here are included as part of the IBM xlf compiler's service and utilities procedures. They will probably not be found in non-IBM Fortran environments.

rtc()

The rtc (real time clock) function returns a REAL(8) value of the number of seconds since the initial value of the machine's real-time clock. Microsecond resolution.

Example Code

PROGRAM RTC_TIME REAL(8) A, B, rtc A = rtc() DO M = 1,2000000 N = N + M END DO B = rtc() PRINT *, 'Seconds elapsed: ',B - A END

irtc()

The irtc function returns a INTEGER(8) value of the number of nanoseconds since the initial value of the machine's real-time clock. Similar to the rtc function in other respects.

Example Code

PROGRAM IRTC_TIME INTEGER(8) A, B, irtc A = irtc() DO M = 1,2000000 N = N + M END DO B = irtc() PRINT *, 'Nanoseconds elapsed: ',B - A END

dtime_()

The dtime_ function may used to obtain the user and system elapsed CPU times since the last call to dtime_.
In the example below, dtime_ sets the user and system elapsed time in the variable DTIME_STRUCT of derived type TB_TYPE. The returned value, DELTA, is the sum of the user elapsed time and the system elapsed time since the last call to dtime_.

The resolution for all timing is 1/100 of a second.

Example Code

PROGRAM DTIME_TIME REAL(4) DELTA, dtime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) DTIME_STRUCT DELTA = dtime_(DTIME_STRUCT) DO M = 1,2000000 N = N + M END DO DELTA = dtime_(DTIME_STRUCT) PRINT *, 'User time: ',DTIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',DTIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',DELTA, 'seconds' END

etime_()

The etime_ function may be used to obtain the user and system elapsed CPU times since the start of the execution of a process.
In the example below, etime_ sets the user and system elapsed time in the variable ETIME_STRUCT of derived type TB_TYPE. The returned value, ELAPSED, is the sum of the user elapsed time and the system elapsed time.

The resolution for all timing is 1/100 of a second.

Example Code

PROGRAM ETIME_TIME REAL(4) ELAPSED, etime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) ETIME_STRUCT DO M = 1,2000000 N = N + M END DO ELAPSED = etime_(ETIME_STRUCT) PRINT *, 'User time: ',ETIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',ETIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',ELAPSED, 'seconds' END

mclock()

The mclock function returns timing information about the current process and its child processes. The returned value is the sum of the current process's user time and the user and system time of all child processes.

The unit of measure is one one-hundredth (1/100) of a second.

Example Code

PROGRAM MCLOCK_TIME INTEGER(4) T1, T2, mclock REAL(4) SECONDS T1 = mclock() DO M = 1,2000000 N = N + M END DO T2 = mclock() SECONDS = REAL(T2 - T1) / 100 PRINT *, 'Elapsed CPU time: ',SECONDS,'seconds' END

timef()

The timef function returns the elapsed time in milliseconds since the first instance timef was called.

The first instance when timef is called, the value 0.0d0 is returned.

Example Code

PROGRAM TIMEF_TIME REAL(8) ELAPSED, timef ELAPSED = timef() DO M = 1,2000000 N = N + M END DO ELAPSED = timef() PRINT *, 'Elapsed time: ',ELAPSED,'milliseconds' END

Profilers

prof

The prof utility is used to profile program execution at the procedure level.
Historically, the prof utility was included in most Unix systems. However, this is no longer true. In some cases, such as LC's Linux clusters, prof is no longer available, being replaced instead by gprof (covered next).
LC Platforms: IBM AIX only
IBM has enabled the mpcc and mpxlf compiler commands to use prof for parallel programs.

prof displays the following information:

The name of each procedure - in descending order of processing activity
The percentage of the program's CPU time used by each procedure.
The execution time in seconds for all references by each procedure.
The number of times the procedure was called.
The average time in milliseconds for a call to each procedure.

Sample prof output (abbreviated)

Name %Time Seconds Cumsecs #Calls msec/call .fft 51.8 0.59 0.59 1024 0.576 .main 40.4 0.46 1.05 1 460. .bit_reverse 7.9 0.09 1.14 1024 0.088 .cos 0.0 0.00 1.14 256 0.00 .sin 0.0 0.00 1.14 256 0.00 .catopen 0.0 0.00 1.14 1 0. .setlocale 0.0 0.00 1.14 1 0. ._doprnt 0.0 0.00 1.14 7 0. ._flsbuf 0.0 0.00 1.14 11 0.0 ._xflsbuf 0.0 0.00 1.14 7 0. ._wrtchk 0.0 0.00 1.14 1 0. ._findbuf 0.0 0.00 1.14 1 0. ._xwrite 0.0 0.00 1.14 7 0. .free 0.0 0.00 1.14 2 0. .free_y 0.0 0.00 1.14 2 0. .write 0.0 0.00 1.14 7 0. .exit 0.0 0.00 1.14 1 0. .memchr 0.0 0.00 1.14 19 0.0 .atoi 0.0 0.00 1.14 1 0. .__nl_langinfo_std 0.0 0.00 1.14 4 0. .gettimeofday 0.0 0.00 1.14 8 0. .printf 0.0 0.00 1.14 7 0.

How to use prof
1. Compile your program with the -p option
2. Run the program. When it completes you should have a file called mon.out which contains runtime statistics. If you are running a parallel program (compiled with mpcc or mpxlf) you will have multiple files differentiated by the taskid which created them, such as mon.out.0 mon.out.1 mon.out.2, etc.
3. For serial users, view the profile statistics with prof by typing prof at the shell prompt in the same directory that you ran the program. By default, prof will look for a file called mon.out and display the statistics contained in it.
4. For parallel users, run prof with the -m option followed by the names of the mon.out.X files you wish to view. You may view any single file or any combination. Examples:
```
    prof -m mon.out.0
    prof -m mon.out.0 mon.out.1 mon.out.2
    prof -m mon.out.*
    
```
Note: If you view more than one mon.out file at a time, the results are averaged into a single report.
See the prof man page for details. Also see the additional notes about prof and gprof in the gprof discussion (next section).

gprof

The gprof utility is included in most Unix systems. Like prof, it is used to profile program execution at the procedure level. Unlike prof, it profiles procedures according to their call graphs.
LC Platforms: All
gprof displays the following information:
- The parent of each procedure.
- An index number for each procedure.
- The percentage of CPU time taken by that procedure and all procedures it calls (the calling tree).
- A breakdown of time used by the procedure and its descendents.
- The number of times the procedure was called.
- The direct descendents of each procedure.
Sample gprof output (IBM AIX)
Parallel programs:
- IBM AIX, BG/L, BG/P - automatically enabled for parallel codes. Each process will produce it's own output file.
- Linux: If you want each parallel process to produce its own output file, you will need to set the undocumented environment variable GMON_OUT_PREFIX to some non-null string. For example:
```
setenv GMON_OUT_PREFIX 'gmon.out'
```
How to use gprof
1. Compile your program with the -pg flag. For the Linux compilers, you can also use -p instead. If your compilation includes the -c option (to produce a *.o file), then you will need to include the -pg / -p during the link/load also.
2. Run the program. When it completes you should have a file called gmon.out which contains runtime statistics. If you are running a parallel program you will have multiple files differentiated by the taskid or process number which created them, such as gmon.out.0 gmon.out.1 gmon.out.2, etc.
3. For serial users, view the profile statistics with gprof by typing gprof at the shell prompt in the same directory that you ran the program. By default, gprof will look for a file called gmon.out and display the statistics contained in it.
4. For parallel users, view the profile statistics with gprof by typing gprof followed by the name of your executable and the gmon.out.X files you wish to view. You may view any single file or any combination. Examples:
```
    gprof myprog gmon.out.0
    gprof myprog gmon.out.0 gmon.out.1 gmon.out.2
    gprof myprog gmon.out.*
        
```
Note: If you view more than one gmon.out file at a time, the results are summed into a single report.
See the Linux gprof man page or AIX gprof man page for details.

Additional Notes About prof and gprof:

Primary difference between prof and gprof timings:
- For prof, CPU time assigned to each procedure does not include CPU time used by procedures further down the calling tree.
- For gprof, CPU time assigned to each procedure includes CPU time used by procedures further down the calling tree. Thus gprof should be used for programs which make calls to library procedures.
All procedures called by the object code, including many not listed in your source program, may be profiled.
Statistics are created by interrupting your program at periodic intervals and equating the interrupt address with a procedure. It is possible for phase problems to occur and cause disproportionately high, or low, execution times to be reported.
prof and gprof show CPU time, not real time. Disk delays due to input and output are not included.
The cumlative seconds output is based on procedure listing, not on invocation order.

monitor

Provides a real-time display with updating on how processes are utilizing a range of machine resources.

LC Platforms: IBM AIX only

Sample output

AIX Monitor v2.1.9llnl 25Sep2003: up041 Thu Jul 30 13:18:53 2009 Uptime: 2 days, 03:50 Users: 2 of 21 active 21 remote 802:02 sleep time CPU: User 0.0% Sys 0.0% Wait 0.0% Idle 100.0% Refresh: 3.43 s 0% 25% 50% 75% 100% Runnable (Swap-in) processes 0.00 (0.00) load average: 0.38, 0.21, 0.22 FOR NON-COMMERCIAL USE ONLY see http://www.mesa.nl/monitor Memory Real Virtual Paging (4kB) Process events File/TTY-IO free 10034 MB N/A 91.8 pgfaults 489 pswitch 0 iget procs 18207 MB N/A 0.0 pgin 287 syscall 9 namei files 3245 MB 0.0 pgout 37 read 0 dirblk total 31488 MB N/A 0.0 pgsin 11 write 94338 readch IO (kB/s) read write busy% 0.0 pgsout 0 fork 2285 writech hdisk0 0.0 0.0 0 0 exec 0 ttyrawch hdisk1 0.0 0.0 0 Client Server NFS/s 0 rcvint 0 ttycanch 0.0 0.0 0 0.0 0.0 calls 0 xmtint 244 ttyoutch 0.0 0.0 0 0.0 0.0 retry 0 mdmint 0.0 0.0 getattr MESA Consulting 0.0 0.0 lookup Netw read write kB/s Unix and Internet 0.0 0.0 read en0 0.2 0.1 Technology Specialists 0.0 0.0 write en1 0.3 1.0 http://www.mesa.nl 0.0 0.0 other en2 7.2 7.7 sn0 1.5 1.9 sn1 9.6 9.5 en4 0.1 0.0 ml0 0.0 7.3 lo0 1.0 1.0

Useful options include -top, which shows the top cpu processes, and -smp, which shows SMP cpu information. Example output for the -smp version is shown below.

See the monitor man page for details.

Sample output for monitor command with -smp option

Load averages: 0.14, 0.18, 0.20 up041 Thu Jul 30 13:22:39 2009 Cpu states: 0.1% user 0.6% system 0.0% wait 99.2% idle For non-commercial Logged on: 21 users 2 active 21 remote 806:01 sleep time use only! Real memory: 18216.7M procs 3244.5M files 10026.8M free 31488.0M total Virtual memory: N/A used N/A free N/A total CPU USER KERN WAIT IDLE% PSW SYSCALL WRITE READ WRITEkb READkb #0 0 1 0 98 49 6959 7 35 2.05 90.32 #1 0 0 0 100 0 0 0 0 0.00 0.00 0 0 0.00 0.00 #2 0 0 0 100 0 0 0 0 0.00 0.00 #3 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #4 0 1 0 98 41 8336 6 25 1.00 18.30 #5 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #6 0 0 0 100 0 0 0 0 0.00 0.00 #7 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #8 0 0 0 99 66 11621 2 37 0.20 96.14 #9 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #10 0 0 0 100 0 0 0 0 0.00 0.00 #11 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #12 0 1 0 98 159 3412 0 3 0.09 5.64 #13 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #14 0 0 0 100 0 0 0 0 0.00 0.00 #15 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 SUM 0 0 0 99 316 30330 17 101 3.33 210.40 PID USER PRI NICE SIZE RES PFLTS STAT USER/SYSTIME CPU% COMMAND 1868 root 255 21 40k 40k 0.0 run 0:00/23:18:03 120.0/44.9 Kernel (wait) 1604 root 255 21 40k 40k 0.0 run 0:00/22:52:49 118.4/44.1 Kernel (wait) 1076 root 255 21 40k 40k 0.0 run 0:00/22:49:01 116.8/44.0 Kernel (wait) 1340 root 255 21 40k 40k 0.0 run 0:00/22:46:50 116.8/43.9 Kernel (wait) 548 root 255 21 40k 40k 0.0 run 0:00/22:20:40 115.2/43.0 Kernel (wait) 812 root 255 21 40k 40k 0.0 run 0:00/21:52:20 112.0/42.1 Kernel (wait) 284 root 255 21 40k 40k 0.0 run 0:00/21:31:39 110.4/41.5 Kernel (wait) 13634 root 255 21 40k 40k 0.0 run 0:00/20:52:10 107.2/40.2 Kernel (wait) 13898 root 255 21 40k 40k 0.0 run 0:00/20:45:07 107.2/40.0 Kernel (wait) 13370 root 255 21 40k 40k 0.0 run 0:00/20:43:28 107.2/39.9 Kernel (wait) 14162 root 255 21 40k 40k 0.0 run 0:00/20:29:21 105.6/39.5 Kernel (wait) 12842 root 255 21 40k 40k 0.0 run 0:00/20:19:08 104.0/39.1 Kernel (wait) 13106 root 255 21 40k 40k 0.0 run 0:00/20:07:20 104.0/38.8 Kernel (wait) 12578 root 255 21 40k 40k 0.0 run 0:00/19:56:01 102.4/38.4 Kernel (wait) 26016 root 40 21 16M 14M 0.0 run 1:45:34/ 1:40:09 17.6/ 6.8 mmfsd64 21118 root 60 0 692k 692k 0.0 run 1:46:25/ 0:00 8.0/ 3.4 Kernel (dog) 95602 root 40 21 320k 320k 0.0 run 1:46:06/ 0:00 8.0/ 3.4 Kernel (vsdkp) 29624 root 60 0 292k 292k 0.0 run 25:03/ 0:00 1.6/ 0.8 Kernel (rtcmd)

top

Provides a real-time display with updating on how processes are utilizing a range of machine resources. Sorted in order of %CPU being used.
LC Platforms: All

See the top man page for details.

Sample output

top - 13:27:07 up 13 days, 18:25, 174 users, load average: 13.37, 12.08, 10.68 Tasks: 1090 total, 15 running, 1068 sleeping, 4 stopped, 3 zombie Cpu(s): 15.1%us, 4.9%sy, 1.3%ni, 76.8%id, 1.7%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 32040664k total, 31970212k used, 70452k free, 78464k buffers Swap: 4096564k total, 3532308k used, 564256k free, 7893264k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11292 jkyall 25 0 1065m 452m 3280 R 90.2 1.4 165:42.07 mcnpx 23745 jkyall 25 0 1065m 447m 3280 R 90.2 1.4 174:33.67 mcnpx 28090 kuug74 25 0 4826m 4.5g 2764 R 88.5 14.8 214:31.25 nike3d 7441 s43in 19 0 96260 49m 1504 R 83.4 0.2 2:05.19 metgrid.exe 31310 kuug74 25 0 5344m 4.8g 2764 R 61.3 15.7 204:34.53 nike3d 21057 syyberg 16 0 21004 1072 860 D 39.1 0.0 18:29.97 tar 25645 s43in 18 0 125m 80m 1504 R 37.4 0.3 5:16.51 metgrid.exe 616 root 11 -5 0 0 0 D 35.7 0.0 163:04.87 kswapd2 614 root 11 -5 0 0 0 R 32.3 0.0 230:06.93 kswapd0 617 root 11 -5 0 0 0 D 28.9 0.0 173:51.87 kswapd3 615 root 11 -5 0 0 0 S 27.2 0.0 159:35.07 kswapd1 12716 neeao 16 0 23168 14m 4172 R 22.1 0.0 0:00.13 cc1plus 10158 h22ringr 15 0 419m 96m 37m R 18.7 0.3 0:04.33 viewer 5716 root 15 0 0 0 0 S 11.9 0.0 596:52.60 ptlrpcd 12708 blaise 15 0 13396 1932 840 R 8.5 0.0 0:00.08 top 12711 s43in 16 0 40996 3296 2276 S 3.4 0.0 0:00.02 vim 4361 kuug74 15 0 4091m 3.9g 2764 S 1.7 12.6 210:39.78 nike3d 4891 root 15 0 0 0 0 S 1.7 0.0 52:02.91 kiblnd_sd_02 10065 h22ringr 15 0 160m 17m 10m S 1.7 0.1 0:01.53 cli 1 root 15 0 10344 660 604 S 0.0 0.0 1:27.74 init 2 root RT -5 0 0 0 S 0.0 0.0 0:26.31 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.34 ksoftirqd/0 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 5 root RT -5 0 0 0 S 0.0 0.0 0:11.74 migration/1 6 root 34 19 0 0 0 S 0.0 0.0 0:00.90 ksoftirqd/1 7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 8 root RT -5 0 0 0 S 0.0 0.0 0:36.14 migration/2 9 root 34 19 0 0 0 R 0.0 0.0 0:01.64 ksoftirqd/2 10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 11 root RT -5 0 0 0 S 0.0 0.0 0:10.40 migration/3

xprofiler

Overview:

LC platforms: IBM AIX, BG/L and BG/P only
xprofiler is an X Windows based profiling tool from IBM. It is based upon the gprof profiling utility, however it provides a graphical representation of profile data in addition to all of the usual gprof reports.
Graphical information is organized into 3 main components:
xprofiler can profile at both the subroutine level, and at the source statement level. Profiling also includes any calls made to library functions.
Time sampled profiling:
- Like many profiling tools, xprofiler acquires its data by keeping track of the executing program's location whenever it is interrupted.
- Interrupts occur at set intervals and are commonly called "ticks". Under AIX, a tick occurs every 1/100th second.
- Those portions of the program which accumulate the most "ticks" reflect the areas where the program spends most of its time.
Filters, zooming, and other options allow you to limit displays to only those portions of the call tree you are interested in analyzing.

Example Displays and Reports:

	Opening display of the entire call tree. Initial display is fully zoomed-out, demonstrating the main library clusters, calling arc summaries, and overall routine control flow.
	Unclustered functions view. Same as previous display, but functions are not grouped within their library cluster.
	Zoom-in view of main program. The smaller "Overview Window" allows easy positioning of desired view area. Zoomed-in area provides function details.
	Collapsed library clusters. Zoomed-out view. Demonstrates how undesired information can be removed/summarized.

	Flat Profile Report. Similar to the same report produced by the gprof utility. Lists functions according to cpu usage.
	Source Code Report showing accumulated ticks for code statement lines.
	Call Graph Profile. Similar to the same report produced by by the gprof utility.
	Library Statistics. Shows execution statistics for libraries called

Using xprofiler:

Location:
- On the ASC Purple platforms, xprofiler should already be in your path as /usr/bin/xprofiler
- For BG/L, you will find it located at /usr/local/hpct_bgl/xprofiler/bin/Xprofiler.
- For BG/P (Dawn), it is located at /usr/local/tools/hpct/bin/Xprofiler
Compile and link your program with both of the options: -g -pg . The -g option enables source statement profiling and -pg turns profiling on.
Note: when you compile and link separately, you must use the -pg option with both the compile and link commands.
Run your serial or parallel code as usual. When it has completed, you will find one statistics file file for each task. Serial codes produce gmon.out. For parallel jobs, the files will be called gmon.out.0, gmon.out.1, gmon.out.2 and so on.

Invoke xprofiler. This can be done several ways as shown below. Note: the examples below assume that you've set up an alias on BG/L and BG/P platforms since the command will not be in your default path, and the command differs by having an uppercase "X".

xprofiler
Starts without a file loaded. Must load file by using xprofiler's File pull-down menu.

xprofiler myprog gmon.out
Loads the serial program with it's stat file

xprofiler myprog gmon.out.N
Loads parallel program with selected stat file

xprofiler myprog gmon.out.*
Loads parallel program with combined/merged stat files

Note that there are also several command line flags available to define certain xprofiler characteristics and behaviors.

Use xprofiler's pull down menus and hidden menus (press right mouse button on an object such as an arc or function box) to accomplish desired actions, such as:

Zooming-in, out
Examining arc information
Examining function statistics
Producing reports
Loading new files
Setting configuration options
Saving/producing screen dumps
Unclustering functions from their library group
Collapsing/hide library information
and more....

xprofiler hidden function menu

xprofiler hidden arc menu

Important note: it is often necessary to uncluster functions and zoom-in to get to important detailed information. It is also usually useful to collapse/hide library information that isn't needed (like system libs).

Documentation:

IBM System Blue Gene Solution: Performance Analysis Tools Redbook
www.redbooks.ibm.com/redpapers/pdfs/redp4256.pdf
Xprofiler man page

mpiP

Overview:

LC platforms: All
mpiP is a lightweight profiling library for MPI applications.
- Software developed by LLNL.
- Collects only statistical information about MPI routines
- Captures and stores information local to each task (local memory and disk)
- Uses communication only at the end of the application to merge results from all tasks into one output file.
mpiP provides statistical information about a program's MPI calls:
- Percent of a task's time attributed to MPI calls
- Where each MPI call is made within the program (callsites)
- Top 20 callsites
- Callsite statistics (for all callsites)
Location: /usr/local/tools/mpiP or /usr/local/tools/mpip
Documentation:
- The doc subdirectory where mpiP is installed
- mpip.sourceforge.net
A GUI interface called mpipview has also been developed for mpiP output files.

Using mpiP:

Using mpiP is very simple. It involves little more than linking with the mpiP libraries. In most cases, recompiling isn't even necessary.
NOTE: At LC, it's highly advisable (if not required) to load the mpiP dotkit package first. For example: use mpip
An example compilation appears below. Other examples can be found in the mpiP documentation listed above. In your compilations, be sure that the mpiP libraries appear before any other MPI libraries.
mpcc_r -g -o myprog myprog.c -L/usr/local/tools/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm
After compiling, run your application as usual. You can verify that mpiP is working by header and trailer output it sends to stdout.
After your application completes, mpiP will write its output file to the current directory. The output file name will have the format of myprog.N.XXXXX.mpiP where N=#MPI tasks and XXXXX=collector task process id.

Understanding mpiP Output:

mpiP's output file is divided into 5 sections:
1. Environment Information
2. MPI Time Per Task
3. Callsites
4. Aggregate Times of Top 20 Callsites
5. Callsite Statistics

Examples of each section follow below.

Environment Information Section

@ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0

MPI Time Per Task

------------------------------------------------------------------------------- @--- MPI Time (seconds) ------------------------------------------------------- ------------------------------------------------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06

Callsites

------------------------------------------------------------------------------- @--- Callsites: 38 ------------------------------------------------------------ ------------------------------------------------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 12 Comm_rank rdopac rdopac.f 29 1000806c 13 Comm_rank sphot sphot.f 67 10008a04 14 Comm_size copyglob copyglob.f 12 10000390 15 Comm_size copypriv copypriv.f 76 10000c50 16 Comm_size sphot sphot.f 68 10008a1c 17 Finalize sphot sphot.f 279 10008f60 18 Init sphot sphot.f 66 100089e8 19 Irecv copyglob copyglob.f 47 10000a70 20 Irecv copypriv@OL@1 copypriv.f 162 1000184c 21 Irecv copypriv@OL@2 copypriv.f 205 10001db4 22 Irecv copypriv@OL@3 copypriv.f 254 1000221c 23 Irecv copypriv@OL@4 copypriv.f 297 10002704 24 Irecv sphot@OL@1@OL@2 sphot.f 201 1000974c 25 Reduce sphot sphot.f 258 10008eac 26 Reduce sphot sphot.f 262 10008eec 27 Send copyglob copyglob.f 60 10000b78 28 Send copypriv@OL@1 copypriv.f 191 10001cb8 29 Send copypriv@OL@2 copypriv.f 233 10002120 30 Send copypriv@OL@3 copypriv.f 275 10002608 31 Send copypriv@OL@4 copypriv.f 320 10002ae8 32 Send sphot@OL@1@OL@3 sphot.f 234 100099d8 33 Waitall copyglob copyglob.f 51 10000ac4 34 Waitall copypriv@OL@1 copypriv.f 168 10001928 35 Waitall copypriv@OL@2 copypriv.f 212 10001e94 36 Waitall copypriv@OL@3 copypriv.f 260 100022fc 37 Waitall copypriv@OL@4 copypriv.f 304 100027e4 38 Waitall sphot@OL@1@OL@2 sphot.f 206 100097b4

Aggregate Times of Top 20 Callsites

------------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) -------------------- ------------------------------------------------------------------------------- Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 Send 27 0.755 0.00 0.00 Send 31 0.694 0.00 0.00 Waitall 37 0.42 0.00 0.00 Send 28 0.336 0.00 0.00 Waitall 35 0.21 0.00 0.00 Waitall 36 0.202 0.00 0.00 Waitall 38 0.2 0.00 0.00 Irecv 19 0.2 0.00 0.00 Reduce 26 0.185 0.00 0.00 Waitall 33 0.161 0.00 0.00 Comm_size 15 0.132 0.00 0.00

Callsite Statistics

------------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------------- ------------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 ..... [snip]..... Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 Send 32 1 1 0.027 0.027 0.027 0.00 0.00 Send 32 2 1 0.042 0.042 0.042 0.00 0.00 Send 32 3 1 0.043 0.043 0.043 0.00 0.00 Send 32 * 3 0.043 0.0373 0.027 0.00 0.00 Waitall 33 0 1 0.161 0.161 0.161 0.00 0.00 Waitall 33 * 1 0.161 0.161 0.161 0.00 0.00 Waitall 34 0 1 25.7 25.7 25.7 0.09 0.36 Waitall 34 * 1 25.7 25.7 25.7 0.02 0.09 Waitall 35 0 1 0.21 0.21 0.21 0.00 0.00 Waitall 35 * 1 0.21 0.21 0.21 0.00 0.00 Waitall 36 0 1 0.202 0.202 0.202 0.00 0.00 Waitall 36 * 1 0.202 0.202 0.202 0.00 0.00 Waitall 37 0 1 0.42 0.42 0.42 0.00 0.01 Waitall 37 * 1 0.42 0.42 0.42 0.00 0.00 Waitall 38 0 1 0.2 0.2 0.2 0.00 0.00 Waitall 38 * 1 0.2 0.2 0.2 0.00 0.00

Performance Analysis Tools

HPM Toolkit

Overview:

The Hardware Performance Monitor (HPM) Toolkit is a set of libraries and utilities provided by IBM that lets the user access hardware event counters within the physical SP processor.
There are many hardware events that can be measured, such as:
- clock cycles
- instructions completed
- L1 load/store misses
- L2 load/store misses
- TLB misses
- FPU/FXU activity
- number of branches
- branch mispredictions
- loads/stores completed
- FMAs executed

The POWER3-II processor possesses 8 hardware counters that can be used to measure many hardware events - up to 8 simultaneously. For convenience, IBM provides four "canned" sets of events that can be measured easily by the user. These four sets are:

Event Set 1 Event Set 2 Event Set 3 Event Set 4

Cycles Cycles Cycles Cycles

Inst. Completed Inst. Completed Inst. Completed TLB Misses

TLB Misses TLB Misses Inst. Cache Misses Loads Completed

Stores Completed Stores Dispatched FXU0 Operations Stores Completed

Loads Completed L1 Store Misses FXU1 Operations L2 Load Misses

FPU0 Operations Loads Dispatched FXU2 Operations L2 Store Misses

FPU1 Operations L1 Load Misses FPU0 Operations Branches

FMAs Executed Load/Store Unit Idle FPU1 Operations Branches Mispredicted

The HPM Toolkit actually consists of three more-or-less independent components:
- hpmcount: a command-line utility that provides an overall view of your code's performance
- libhpm: an instrumentation library that provides the user with a set of function calls for Fortran, C, and C++ programs that can be used to manually instrument sections of your code.
- hpmviz: a GUI utility for viewing the performance results generated by libhpm calls.
One of the greatest advantages of the HPM toolkit is that it can quickly tell you why your code is performing the way it does from a hardware level.

Event Set 1	Event Set 2	Event Set 3	Event Set 4
Cycles	Cycles	Cycles	Cycles
Inst. Completed	Inst. Completed	Inst. Completed	TLB Misses
TLB Misses	TLB Misses	Inst. Cache Misses	Loads Completed
Stores Completed	Stores Dispatched	FXU0 Operations	Stores Completed
Loads Completed	L1 Store Misses	FXU1 Operations	L2 Load Misses
FPU0 Operations	Loads Dispatched	FXU2 Operations	L2 Store Misses
FPU1 Operations	L1 Load Misses	FPU0 Operations	Branches
FMAs Executed	Load/Store Unit Idle	FPU1 Operations	Branches Mispredicted

Using hpmcount:

Using the command-line hpmcount utility is fairly straight-forward. First, find where the executable is located if it isn't already in your path - and then put it in your path.
Compile your source as usual

Invoke your executable through hpmcount. The general syntax is:

hpmcount [-h] [-o <file name>] [-s <set>] [-e ev[,ev]*] program

where:

`-h`	Prints help message
`-o <filename>`	Sets the output file name. For parallel jobs, each process gets a unique output file.
`-s <set>`	Defines one of four default event sets to measure. Event set 1 is the default.
`-e ev[,ev]*`	Manually defines a list of hardware counters to monitor. (Only for the truly adventurous.)

Sample hpmcount output
% hpmcount a.out -s 1 Running pipe()... s= 80200000000.0000000 Running unroll()... s= 51840000160.0000000 Running strength()... s= 84147.0984807570785 Running block()... c(N,N) = 67239936.0000000000 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed hpmcount (V 2.3.1) summary Total execution time (wall clock time): 85.366748 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 75.190000 seconds Total amount of time in system mode : 0.090000 seconds Maximum resident set size : 7648 Kbytes Average shared memory use in text segment : 120432 Kbytes*sec Average unshared memory use in data segment : 26017908 Kbytes*sec Number of page faults without I/O activity : 1912 Number of page faults with I/O activity : 3 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 47 Number of involuntary context switches : 1321 ####### End of Resource Statistics ######## PM_CYC (Cycles) : 27845219686 PM_INST_CMPL (Instructions completed) : 20517718852 PM_TLB_MISS (TLB misses) : 138272209 PM_ST_CMPL (Stores completed) : 1722228911 PM_LD_CMPL (Loads completed) : 6770324612 PM_FPU0_CMPL (FPU 0 instructions) : 1485346931 PM_FPU1_CMPL (FPU 1 instructions) : 200209450 PM_EXEC_FMA (FMAs executed) : 1259225176 Utilization rate : 86.968 % Avg number of loads per TLB miss : 48.964 Load and store operations : 8492.554 M Instructions per load/store : 2.416 MIPS : 240.348 Instructions per cycle : 0.737 HW Float points instructions per Cycle : 0.061 Floating point instructions + FMAs : 2944.782 M Float point instructions + FMA rate : 34.496 Mflip/s FMA percentage : 85.522 % Computation intensity : 0.347

Using libhpm:

The libhpm library consists of several C/C++/Fortran routines that are used to instrument specific sections of your code. You must insert these routines into your source code manually as desired. You must also supply the necessary include file.
When compiling your code, you will be required to link with the necessary libhpm libraries.
Documentation and examples are distributed with the source code.

Then execute your code as usual. libhpm will create two files for each task:

perfhpm[taskID].[pid] which contains a text version of the performance data
hpm[taskID]_[progName]_[pid].viz which contains data for the hpmviz program.

For example, a 4 task run of an executable called "swim" produces"

-rw-------   1 blaise   blaise        31840 Jul 18 18:38 hpm0000_swim_53912.viz
-rw-------   1 blaise   blaise        31832 Jul 18 18:38 hpm0001_swim_53042.viz
-rw-------   1 blaise   blaise        31840 Jul 18 18:38 hpm0002_swim_51346.viz
-rw-------   1 blaise   blaise        31840 Jul 18 18:38 hpm0003_swim_56720.viz
-rw-------   1 blaise   blaise        52662 Jul 18 18:38 perfhpm0000.53912
-rw-------   1 blaise   blaise        52633 Jul 18 18:38 perfhpm0001.53042
-rw-------   1 blaise   blaise        52684 Jul 18 18:38 perfhpm0002.51346
-rw-------   1 blaise   blaise        52673 Jul 18 18:38 perfhpm0003.56720

Example libhpm report

Using hpmviz:

Rather than sorting through the text output files generated with the libhpm instrumentation, the graphical utility hpmviz can be used to easily browse both the performance data and the associated source code.
After making sure your X-Windows display environment is setup correctly, you can invoke hpmviz with any of the *.viz files produced by libhpm. For example:
```
hpmviz hpm0003_swim_56720.viz
```
The initial two-paned hpmviz window will appear, showing any instrumented routines in the left pane and their source code in the right pane. Example here
Left clicking on any instrumentation label (routine name) in the left pane will cause its source code to appear in the right pane. Example here
Right clicking on any instrumentation label in the left pane will cause a new window to appear showing that code segment's statistics. Example here
In the statistics window, red is used to indicate "bad" performance and "green" for good performance.

PE Benchmarker Toolset

Overview:

The PE Benchmarker toolset is a suite of applications and utilities designed to analyze the performance of programs run within the IBM Parallel Environment for AIX.
Requires:
- AIX version 5+
- PSSP version 3.4+
- IBM's Java software version 1.3
- Argonne National Lab's Jumpshot utility (for viewing MPI events)
The PE Benchmarker suite consists of three primary components:
- Performance Collection Tool (PCT): Enables you to collect either MPI and user event data, or hardware and operating system profiles for one or more application processes. Results in the production of AIX trace files.
- Unified Trace Environment (UTE) utilities: Permit you to convert and/or merge the PCT produced AIX trace files into UTE format to produce statistical a report or for use with the Jumpshot GUI.
- Profile Visualization Tool (PVT): When you collect hardware and operating system profiles using the PCT, it is saved as netCDF (network Common Data Form) files. The PVT can read netCDF files and summarize the profile information in reports.

Using the Performance Collection Tool (PCT):

Assuming that your software environment is setup correctly, you can start the Performance Collection Tool in either graphical (GUI) mode or command line mode. Examples:
pct - graphical mode. Example here
pct -c - command line
pct -c -s - command line reading commands from a script file
You can load a new application or attach to one that is already running
Enables you to select monitoring of MPI events, user events or hardware and operating system events - from either the GUI or command line interface.
Monitors events at the program, file and/or function level
Provides built-in help (much more for the GUI than the command line interface)
Stores program execution information in AIX trace file format. Requires the UTE utilities or Profile Visualization Tool to interpret.

Using the UTE Utilties:

Intended for use only with MPI and user events collected by PCT's AIX trace files
Convert AIX timestamp files into "interval" files - to time how long events actually take, and for possible visualization.
Four utilities are provided:
- uteconvert: converts AIX event trace files into UTE interval trace files.
- utemerge: merges multiple UTE interval files into a single UTE interval file.
- utestats: generates statistics tables from UTE interval files.
- slogmerge: converts and merges UTE interval files into a single SLOG file for analysis within Argonne National Laboratory's Jumpshot tool.
Jumpshot must be acquired and installed separately (it is part of Argonne's MPICH distribution).

Using the Profile Visualization Tool (PVT):

Provides both a GUI and command line interface for analyzing hardware and operating system events collected by the PCT.
Startup examples:
pvt - starts the GUI
pvt filename(s) - starts GUI with one or more input files. Example here
pvt -c - starts the command line interface
pvt -c filename(s) - starts command line interface with one or more input files
Options enable you to view:
- Function Call Count
- Wall Clock Time
- Resource Usage
- Hardware Counters
The GUI has extensive built-in help, like PCT

Documentation: (you're going to need this)

IBM Parallel Environment for AIX: Operation and Use Volume 2 Tools Reference. Complete product documentation.

VampirGuideView (VGV)

VGV is a performance analysis tool for mixed OpenMP and MPI parallel programming. It has been developed by Intel KAI lab and Pallas as a part of the ASCI PathForward Program.
It has been designed to support application performance measurement, visualization, analysis, and improvement. A major focus is scalability to keep pace with the rapid improvements in ASCI platforms
A sophisticated and evolving tool available to LLNL ASCI users.
Deserves a full day tutorial in itself. Such has been offered several times at LLNL. Tutorials and further information is available at: Livermore Computing Training - see the "Online Tutorials" section.

Paraver and Dimemas

Paraver and Dimemas are two performance analysis tools from CEPBA (European Center for Parallelism of Barcelona).
Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI, OpenMP, MPI+OpenMP, Java, hardware counters profiling, operating system activity and more.
Dimemas is performance analysis tool for message-passing programs. It enables the user to develop and tune parallel applications on a workstation while providing an accurate prediction of their performance on the target parallel machine.
Paraver and Dimemas are both sophisticated tools that deserve a full day workshop in themselves. Such has been recently conducted at LLNL. Tutorials and further information is available at: Livermore Computing Training - see the "Online Tutorials" section.
PS: "Paraver" means "to see" and "Dime mas" means "tell me more" in Spanish.

Performance Toolbox

Overview:

The Performance Toolbox is an X Windows based application from IBM that provides a graphical means of monitoring, analyzing and controlling AIX real time machine resource usage.
Statistics that can be monitored include:
- cpu utilization
- disk activity
- paging activity
- network activity
- list top processes
Features include:
- Customization of which statistics are monitoring and how they are displayed
- Ability to record and playback statistics
- Multi-processor 3D-monitor
- Monitor local or remote systems
Must be obtained from IBM and then installed on your system as an IBM software product. It is not part of the operating system or the usual SP software suite. Not currently available at LLNL.

Example Displays:

	Performance Toolbox monitor view 1
	Performance Toolbox monitor view 2:
	Performance Toolbox 3D monitor view 1: multiple hosts
	Performance Toolbox 3D monitor view 2: single host

Dynamic Probe Class Library (DPCL)

Overview:

The Dynamic Probe Class Library (DPCL) is a C++ class library whose application programming interface (API) enables a program to dynamically insert instrumentation code patches, or "probes", into an executing program.
The program that uses DPCL calls to insert probes is called the "analysis tool", while the program that accepts and runs the probes is called the "target application".
The DPCL product is an asynchronous software system designed to serve as a foundation for a variety of analysis tools that need to dynamically instrument (insert probes into and remove probes from) target applications.
In addition to its API, the DPCL system consists of daemon processes that attach themselves to the target application process(es) to perform most of the actual work, and an asynchronous communication and callback facility that connects the class library to the daemon processes.
DPCL is not intended for the average user. It is intended for the developers of performance analysis tools for the IBM environment.
Was considered part of POE software 3.1, however it is now distributed with POE version 3.2 as a convenience and is actually offered through IBM's open source sofware at http://oss.software.ibm.com.

Using DPCL:

Need to be a brave C++ programmer and have the most recent IBM hardware/software.
Otherwise, wait until IBM and/or other independent software developers release tools based upon DPCL. The good news is that these are already in progress.

Documentation:

DPCL Home Page

Other Multi-Platform Parallel Performance Analysis Tools:

A number of multi-platform, parallel, performance analysis tools are available. The table below summarizes some of the more popular and widely used ones.

This information changes. New tools are appearing all of the time and some of these may be extinct already.

All of these tools are sophisticated and require a learning curve. Some more than others.

In the table below, each tool is linked to its corresponding home page in column 1.

Tools with an * asterisk are installed on LLNL ASCI systems.

Package Supplier Availability
AIMS NASA Ames Research Center Free
DEEP Veridian Systems $$?
Falcon Georgia Tech Free
SiGMA IBM (ACTC) ???
Jumpshot * Argonne National Lab - distributed with MPICH Free
Pablo/SvPablo University of Illinois, Urbana-Champaign Free (non commericial use)
Paradyn * University of Wisconsin Free
PGPROF Portland Group, Inc. License
TAU * Univ. Oregon Free

Miscellaneous Tools

vmstat - Virtual Memory Statistics

Displays virtual memory statistics. Syntax:
```
     vmstat [options] n [m] 
```
n = interval in seconds
m = count (optional) - number of interval lines to print
First line of output represents statistics since system initialization. Subsequent lines are statistics collected during the specified interval period.

See the vmstat man page for additional details.

Sample output

% vmstat 2 3 kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 2 1 504038 3489832 0 0 0 0 0 0 2760 8705 1730 4 3 92 0 1 0 503080 3490975 0 0 0 0 0 0 5676 8619 4277 0 15 85 0 0 0 503073 3490982 0 0 0 0 0 0 1937 2707 232 0 0 99 0
* 1 page = 4096 bytes

Legend:

r = kernel threads placed on the run queue
b = kernel threads blocked
avm = active virtual memory pages
fre = free memory pages
re= pager input/output list
pi = page-ins from paging space po = page-outs to paging space
fr = pages freed
sr = pages scanned
cy = scan cycles
in = device interrupts
sy = system calls cs = context switches
us = % user cpu utilization
sy = % system cpu utilization
id = % cpu idle time
wa = % cpu time waiting on I/O

netstat - Network Statistics

Displays network statistics. Numerous options and outputs. Syntax
```
     netstat [options] n 
```
n = interval in seconds
First line of output shows summary statistics since system initialization Subsequent lines are statistics collected during the specified interval period.

See the netstat man page for details.

Sample output

% netstat -i - shows state of all configured interfaces Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll en0 1500 link#2 0.4.ac.ec.11.e 319849 0 2787430 0 0 en0 1500 134.9.46 efrost067 319849 0 2787430 0 0 en1 1500 link#3 0.4.ac.7c.d6.3a 60416480 0 119342413 0 0 en1 1500 134.9.45 frost067-ge0.llnl 60416480 0 119342413 0 0 en2 1500 link#4 0.4.ac.7c.d3.1d 111043029 0 131259660 3 0 en2 1500 134.9.6 frost067-ge1.llnl 111043029 0 131259660 3 0 en3 9000 link#5 0.4.ac.7c.d6.43 12557524 0 14013408 0 0 en3 9000 134.9.60 frost067-ge2.llnl 12557524 0 14013408 0 0 en4 9000 link#6 0.4.ac.7c.d5.37 12555933 0 13775758 0 0 en4 9000 134.9.61 frost067-ge3.llnl 12555933 0 13775758 0 0 en5 9000 link#7 0.4.ac.7c.d5.36 12711192 0 13896158 0 0 en5 9000 134.9.62 frost067-ge4.llnl 12711192 0 13896158 0 0 en6 9000 link#8 0.4.ac.7c.d5.21 12408381 0 12936330 0 0 en6 9000 134.9.63 frost067-ge5.llnl 12408381 0 12936330 0 0 css0 65504 link#9 20162383 0 47045300 0 0 css0 65504 134.9.48 frost067-css0 20162383 0 47045300 0 0 css1 65504 link#10 20155508 0 47041708 0 0 css1 65504 134.9.49 frost067-css1 20155508 0 47041708 0 0 ml0 65504 link#11 0 0 75604311 4367 0 ml0 65504 134.9.47 frost067 0 0 75604311 4367 0 ml0 65504 134.9.47 frostgw 0 0 75604311 4367 0 lo0 16896 link#1 1902914 0 1903901 0 0 lo0 16896 127 localhost 1902914 0 1903901 0 0 lo0 16896 ::1 1902914 0 1903901 0 0
% netstat -Icss0 2 - monitor IP traffic over High-Performance Switch input (css0) output input (Total) output packets errs packets errs colls packets errs packets errs colls 20167216 0 47050260 0 0 264249001 0 479647008 4370 0 110 0 110 0 0 275 0 366 0 0 167 0 164 0 0 344 0 337 0 0 171 0 167 0 0 348 0 339 0 0 82 0 81 0 0 179 0 198 0 0 181 0 186 0 0 734 0 1212 0 0 168 0 168 0 0 595 0 510 0 0 111 0 112 0 0 315 0 523 0 0 211 0 216 0 0 535 0 648 0 0

iostat - I/O Statistics

Displays I/O (disk activity) statistics. Syntax:
```
     iostat [options] n [m] 
```
n = interval in seconds
m = count (optional)
First line of output displays statistics since system initialization. Subsequent lines are statistics collected during the specified interval period.

See the iostat man page for details and interpretation of output columns.

Sample output

% iostat 10 3 tty: tin tout avg-cpu: % user % sys % idle % iowait 1.9 569.4 4.1 3.4 92.2 0.3 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 1.2 79.3 6.2 497100 9818957 hdisk1 0.4 56.7 1.8 48840 7328280 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.3 1183.6 0.4 0.2 99.3 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 1.6 0.1 0 16 hdisk1 0.0 1.6 0.1 0 16 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.1 418.2 0.1 0.7 99.2 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 3.9 67.7 13.0 0 677 hdisk1 0.2 10.4 1.5 0 104

ps - Process Status

Shows current status of processes. Many options and outputs.
Provides a quick and easy way to view overall memory and cpu usage by any/all processes on the system.

See the ps man page for details.

Sample output

% ps ux USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND userjoe 41728 62.9 0.0 352 396 pts/15 A 18:39:14 5:02 a.out userjoe 41728 22.3 33.0 412 478 pts/15 A 18:39:14 5:02 b.out userjoe 41502 0.0 0.0 696 828 pts/15 A 17:25:27 0:03 -tcsh userjoe 13334 0.0 0.0 200 252 pts/15 A 18:40:14 0:00 ps ux

Several tools which fall into the "other" category are available for the SP environment. Note that some of these tools are installed at LLNL, under development and/or unsupported. Some may even be extinct.

mpi_trace:

Provides statistical information for an application's MPI calls.
Two components:
- mpi_trace.c, mpi_trace.o: low overhead wrappers for MPI routines
- libmpiprof.a: MPI time per user-level subroutine. Requires compilation with -g option.
To use:
- Link with either mpi_trace.o or libmpiprof.a
- Run the application - be sure to call MPI_Finalize().
- An output file will be produced for each MPI task, called mpi_profile.N where N is the MPI rank number.
- No utility is provided to summarize all reports - they must be viewed individually or processed through a utility like grep or a "home grown" filter/utility.
Status: Not an IBM product or generally available. Provided to LLNL as part of a previous workshop. Still available on ASCI systems under /usr/local/spclass/walkup. Some documentation included.
Sample mpi_trace output

MPIMap:

MPIMap is a tool for graphically displaying the structure of complex MPI data types.
Only works with MPICH, but should be portable to any Unix implementation of MPICH.
Status: Developed at LLNL. Available for LC users.
Additional information: see the LLNL MPIMap web page.

And More...

MPX - LLNL developed hardware performance monitor. Produces min, max and mean statistics for parallel jobs. Can be used on LLNL ASCI IBM systems. See www.llnl.gov/casc/mpx for download and usage information.
Umpire - LLNL developed MPI correctness tool. Available later.
PAPI - University of Tennessee portable hardware performance monitor API
PMAPI - IBM hardware performance monitor API

This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?

References and More Information

"Optimization and Tuning Guide for Fortran, C, and C++". AIX Version 4. IBM Corporation.
"Using ASCI White: Advanced Topics for Early Users". Workshop materials from Bob Walkup, IBM Corporation. Jan. 2001.
"XL Fortran: Eight Ways to Boost Performance", IBM Corporation.
www.software.ibm.com/ad/fortran/xlfortran/optim.htm
IBM Parallel Environment for AIX Manuals: Operation and Use Volume 1 and Volume 2. IBM Corporation.
www-1.ibm.com/servers/eserver/pseries/library/sp_books/pe.html
"Performance Basics". Tutorial by the Cornell Theory Center.
www.tc.cornell.edu

xprofiler	Starts without a file loaded. Must load file by using xprofiler's File pull-down menu.
xprofiler myprog gmon.out	Loads the serial program with it's stat file
xprofiler myprog gmon.out.N	Loads parallel program with selected stat file
xprofiler myprog gmon.out.*	Loads parallel program with combined/merged stat files

Package	Supplier	Availability
AIMS	NASA Ames Research Center	Free
DEEP	Veridian Systems	$$?
Falcon	Georgia Tech	Free
SiGMA	IBM (ACTC)	???
Jumpshot *	Argonne National Lab - distributed with MPICH	Free
Pablo/SvPablo	University of Illinois, Urbana-Champaign	Free (non commericial use)
Paradyn *	University of Wisconsin	Free
PGPROF	Portland Group, Inc.	License
TAU *	Univ. Oregon	Free

Performance Analysis Tools

NOTE: This information pertains to retired LC systems and is being kept for archival purposes only.

Table of Contents

time

timex

gettimeofday()

MPI Timing Routines

system_clock()

read_real_time() read_wall_time() time_base_to_time()

xlf Fortran Timing Routines

prof

gprof

monitor

top

xprofiler

mpiP

HPM Toolkit

PE Benchmarker Toolset

VampirGuideView (VGV)

Paraver and Dimemas

Performance Toolbox

Dynamic Probe Class Library (DPCL)

Other Multi-Platform Parallel Performance Analysis Tools:

Miscellaneous Tools

vmstat - Virtual Memory Statistics

netstat - Network Statistics

iostat - I/O Statistics

ps - Process Status

read_real_time()
read_wall_time()
time_base_to_time()