|
||
Blaise Barney, Lawrence Livermore National Laboratory | UCRL-MI-133316 |
Abstract |
An essential prerequisite for optimizing an application is to first understand its execution characteristics. A number of tools are available for the application developer to accomplish this, ranging from simple shell utilities, timers and profilers, trace analysis tools, to sophisticated full featured graphical toolsets. This tutorial investigates, in varying depths, a number of tools that can be used to analyze an application's performance towards the goals of optimization and trouble-shooting. A lab exercise featuring a subset of these tools is provided.
Level/Prerequisites: A basic understanding of parallel programming in C or Fortran is assumed.
Scope and Motivation |
Scope of This Tutorial:
Application Related Factors | Hardware Related Factors | Software Related Factors |
---|---|---|
|
|
|
Performance Considerations and Strategies |
Procedure | % CPU Time |
---|---|
main()
procedure1() procedure2() procedure3() |
13%
17% 20% 50% |
A 20% increase in the performance of procedure3() results in a 10% performance increase overall.
A 20% increase in the performance of main() results in only a 2.6% performance increase overall.
Timers |
Timer | Usage | Wallclock / CPU Time | Resolution (seconds) |
Languages | Portable? |
---|---|---|---|---|---|
time | shell / script | both | 10-2 / 10-3 | n/a | yes |
timex | shell / script | both | 10-2 / 10-3 | n/a | yes |
gettimeofday | subroutine | wallclock | 10-6 | C/C++ | yes |
MPI_Wtime | subroutine | wallclock | 10-6 | C/C++, Fortran | yes |
system_clock | subroutine | wallclock | varies | Fortran90 | yes |
read_real_time
read_wall_time time_base_to_time |
subroutine | wallclock | 10-9 | IBM AIX C/C++ | no |
rtc | subroutine | wallclock | 10-6 | IBM Fortran | no |
irtc | subroutine | wallclock | 10-9 | IBM Fortran | no |
dtime_ | subroutine | CPU | 10-2 | IBM Fortran | no |
etime_ | subroutine | CPU | 10-2 | IBM Fortran | no |
mclock | subroutine | CPU | 10-2 | IBM Fortran | no |
timef | subroutine | wallclock | 10-3 | IBM Fortran | no |
1 2 3 4 5 6 7 8 1.150u 0.020s 0:01.76 66.4% 15+3981k 24+10io 0pf+0w |
Explanation:
real 0m2.58s user 0m1.14s sys 0m0.03s |
Explanation:
Note that much of the timex command's output is NOT described in the timex man page. Some of it may be understood from reading the sar command man page. |
Example Code Fragment
#include <time.h> #include <sys/time.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]); { struct timeval start, end; gettimeofday(&start;, NULL); /* do some work */ gettimeofday(&end;, NULL); printf("Total time was %ld uSec.\n", ((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))); return 0; } |
Example Codes (C and Fortran)
#include "mpi.h" int main(int argc, char *argv[]); { int n=0,m; double start,end,resolution; MPI_Init(&argc;,&argv;); start = MPI_Wtime(); /* start time */ for (m=0;m<2000000;m++) n = n + m; end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("Wallclock times(secs): start= %f end= %f \n",start,end); printf("elapsed= %e resolution= %e\n", end-start,resolution); MPI_Finalize (); } |
Example Code Fragment
PROGRAM SYSCLOCK INTEGER :: T1, T2, max, rate REAL :: T3 CALL SYSTEM_CLOCK(COUNT=T1) ! Start timing ! ...Do your calculation here, for example:... DO M = 1,882000000 N = N + M END DO CALL SYSTEM_CLOCK(COUNT=T2) ! Stop timing PRINT *, "Start: ",T1," End: ",T2," Count= ",T2-T1 ! Calculate the elapsed time in seconds: CALL SYSTEM_CLOCK(COUNT_RATE=rate, COUNT_MAX=max) ! Find the rate and max PRINT *, "Rate= ",rate," Max= ",max T3=(T2-T1)/REAL(rate) PRINT *, "Elapsed time = ",T3, " seconds" END |
Example Code
PROGRAM RTC_TIME REAL(8) A, B, rtc A = rtc() DO M = 1,2000000 N = N + M END DO B = rtc() PRINT *, 'Seconds elapsed: ',B - A END |
Example Code
PROGRAM IRTC_TIME INTEGER(8) A, B, irtc A = irtc() DO M = 1,2000000 N = N + M END DO B = irtc() PRINT *, 'Nanoseconds elapsed: ',B - A END |
Example Code
PROGRAM DTIME_TIME REAL(4) DELTA, dtime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) DTIME_STRUCT DELTA = dtime_(DTIME_STRUCT) DO M = 1,2000000 N = N + M END DO DELTA = dtime_(DTIME_STRUCT) PRINT *, 'User time: ',DTIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',DTIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',DELTA, 'seconds' END |
Example Code
PROGRAM ETIME_TIME REAL(4) ELAPSED, etime_ TYPE TB_TYPE SEQUENCE REAL(4) USRTIME REAL(4) SYSTIME END TYPE TYPE (TB_TYPE) ETIME_STRUCT DO M = 1,2000000 N = N + M END DO ELAPSED = etime_(ETIME_STRUCT) PRINT *, 'User time: ',ETIME_STRUCT%USRTIME, 'seconds' PRINT *, 'System time: ',ETIME_STRUCT%SYSTIME, 'seconds' PRINT *, 'Elapsed time: ',ELAPSED, 'seconds' END |
Example Code
PROGRAM MCLOCK_TIME INTEGER(4) T1, T2, mclock REAL(4) SECONDS T1 = mclock() DO M = 1,2000000 N = N + M END DO T2 = mclock() SECONDS = REAL(T2 - T1) / 100 PRINT *, 'Elapsed CPU time: ',SECONDS,'seconds' END |
Example Code
PROGRAM TIMEF_TIME REAL(8) ELAPSED, timef ELAPSED = timef() DO M = 1,2000000 N = N + M END DO ELAPSED = timef() PRINT *, 'Elapsed time: ',ELAPSED,'milliseconds' END |
Profilers |
Sample prof output (abbreviated) |
Name %Time Seconds Cumsecs #Calls msec/call .fft 51.8 0.59 0.59 1024 0.576 .main 40.4 0.46 1.05 1 460. .bit_reverse 7.9 0.09 1.14 1024 0.088 .cos 0.0 0.00 1.14 256 0.00 .sin 0.0 0.00 1.14 256 0.00 .catopen 0.0 0.00 1.14 1 0. .setlocale 0.0 0.00 1.14 1 0. ._doprnt 0.0 0.00 1.14 7 0. ._flsbuf 0.0 0.00 1.14 11 0.0 ._xflsbuf 0.0 0.00 1.14 7 0. ._wrtchk 0.0 0.00 1.14 1 0. ._findbuf 0.0 0.00 1.14 1 0. ._xwrite 0.0 0.00 1.14 7 0. .free 0.0 0.00 1.14 2 0. .free_y 0.0 0.00 1.14 2 0. .write 0.0 0.00 1.14 7 0. .exit 0.0 0.00 1.14 1 0. .memchr 0.0 0.00 1.14 19 0.0 .atoi 0.0 0.00 1.14 1 0. .__nl_langinfo_std 0.0 0.00 1.14 4 0. .gettimeofday 0.0 0.00 1.14 8 0. .printf 0.0 0.00 1.14 7 0. |
prof -m mon.out.0 prof -m mon.out.0 mon.out.1 mon.out.2 prof -m mon.out.*
Note: If you view more than one mon.out file at a time, the results are averaged into a single report. |
Sample gprof output (IBM AIX)
setenv GMON_OUT_PREFIX 'gmon.out'
gprof myprog gmon.out.0 gprof myprog gmon.out.0 gmon.out.1 gmon.out.2 gprof myprog gmon.out.*
Note: If you view more than one gmon.out file at a time, the results are summed into a single report. |
Additional Notes About prof and gprof:
Sample output |
AIX Monitor v2.1.9llnl 25Sep2003: up041 Thu Jul 30 13:18:53 2009 Uptime: 2 days, 03:50 Users: 2 of 21 active 21 remote 802:02 sleep time CPU: User 0.0% Sys 0.0% Wait 0.0% Idle 100.0% Refresh: 3.43 s 0% 25% 50% 75% 100% Runnable (Swap-in) processes 0.00 (0.00) load average: 0.38, 0.21, 0.22 FOR NON-COMMERCIAL USE ONLY see http://www.mesa.nl/monitor Memory Real Virtual Paging (4kB) Process events File/TTY-IO free 10034 MB N/A 91.8 pgfaults 489 pswitch 0 iget procs 18207 MB N/A 0.0 pgin 287 syscall 9 namei files 3245 MB 0.0 pgout 37 read 0 dirblk total 31488 MB N/A 0.0 pgsin 11 write 94338 readch IO (kB/s) read write busy% 0.0 pgsout 0 fork 2285 writech hdisk0 0.0 0.0 0 0 exec 0 ttyrawch hdisk1 0.0 0.0 0 Client Server NFS/s 0 rcvint 0 ttycanch 0.0 0.0 0 0.0 0.0 calls 0 xmtint 244 ttyoutch 0.0 0.0 0 0.0 0.0 retry 0 mdmint 0.0 0.0 getattr MESA Consulting 0.0 0.0 lookup Netw read write kB/s Unix and Internet 0.0 0.0 read en0 0.2 0.1 Technology Specialists 0.0 0.0 write en1 0.3 1.0 http://www.mesa.nl 0.0 0.0 other en2 7.2 7.7 sn0 1.5 1.9 sn1 9.6 9.5 en4 0.1 0.0 ml0 0.0 7.3 lo0 1.0 1.0 |
Sample output for monitor command with -smp option |
Load averages: 0.14, 0.18, 0.20 up041 Thu Jul 30 13:22:39 2009 Cpu states: 0.1% user 0.6% system 0.0% wait 99.2% idle For non-commercial Logged on: 21 users 2 active 21 remote 806:01 sleep time use only! Real memory: 18216.7M procs 3244.5M files 10026.8M free 31488.0M total Virtual memory: N/A used N/A free N/A total CPU USER KERN WAIT IDLE% PSW SYSCALL WRITE READ WRITEkb READkb #0 0 1 0 98 49 6959 7 35 2.05 90.32 #1 0 0 0 100 0 0 0 0 0.00 0.00 0 0 0.00 0.00 #2 0 0 0 100 0 0 0 0 0.00 0.00 #3 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #4 0 1 0 98 41 8336 6 25 1.00 18.30 #5 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #6 0 0 0 100 0 0 0 0 0.00 0.00 #7 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #8 0 0 0 99 66 11621 2 37 0.20 96.14 #9 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #10 0 0 0 100 0 0 0 0 0.00 0.00 #11 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #12 0 1 0 98 159 3412 0 3 0.09 5.64 #13 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 #14 0 0 0 100 0 0 0 0 0.00 0.00 #15 -2147483648 -2147483648 -2147483648 -2147483648 0 0 0 0 0.00 0.00 SUM 0 0 0 99 316 30330 17 101 3.33 210.40 PID USER PRI NICE SIZE RES PFLTS STAT USER/SYSTIME CPU% COMMAND 1868 root 255 21 40k 40k 0.0 run 0:00/23:18:03 120.0/44.9 Kernel (wait) 1604 root 255 21 40k 40k 0.0 run 0:00/22:52:49 118.4/44.1 Kernel (wait) 1076 root 255 21 40k 40k 0.0 run 0:00/22:49:01 116.8/44.0 Kernel (wait) 1340 root 255 21 40k 40k 0.0 run 0:00/22:46:50 116.8/43.9 Kernel (wait) 548 root 255 21 40k 40k 0.0 run 0:00/22:20:40 115.2/43.0 Kernel (wait) 812 root 255 21 40k 40k 0.0 run 0:00/21:52:20 112.0/42.1 Kernel (wait) 284 root 255 21 40k 40k 0.0 run 0:00/21:31:39 110.4/41.5 Kernel (wait) 13634 root 255 21 40k 40k 0.0 run 0:00/20:52:10 107.2/40.2 Kernel (wait) 13898 root 255 21 40k 40k 0.0 run 0:00/20:45:07 107.2/40.0 Kernel (wait) 13370 root 255 21 40k 40k 0.0 run 0:00/20:43:28 107.2/39.9 Kernel (wait) 14162 root 255 21 40k 40k 0.0 run 0:00/20:29:21 105.6/39.5 Kernel (wait) 12842 root 255 21 40k 40k 0.0 run 0:00/20:19:08 104.0/39.1 Kernel (wait) 13106 root 255 21 40k 40k 0.0 run 0:00/20:07:20 104.0/38.8 Kernel (wait) 12578 root 255 21 40k 40k 0.0 run 0:00/19:56:01 102.4/38.4 Kernel (wait) 26016 root 40 21 16M 14M 0.0 run 1:45:34/ 1:40:09 17.6/ 6.8 mmfsd64 21118 root 60 0 692k 692k 0.0 run 1:46:25/ 0:00 8.0/ 3.4 Kernel (dog) 95602 root 40 21 320k 320k 0.0 run 1:46:06/ 0:00 8.0/ 3.4 Kernel (vsdkp) 29624 root 60 0 292k 292k 0.0 run 25:03/ 0:00 1.6/ 0.8 Kernel (rtcmd) |
Sample output |
top - 13:27:07 up 13 days, 18:25, 174 users, load average: 13.37, 12.08, 10.68 Tasks: 1090 total, 15 running, 1068 sleeping, 4 stopped, 3 zombie Cpu(s): 15.1%us, 4.9%sy, 1.3%ni, 76.8%id, 1.7%wa, 0.1%hi, 0.2%si, 0.0%st Mem: 32040664k total, 31970212k used, 70452k free, 78464k buffers Swap: 4096564k total, 3532308k used, 564256k free, 7893264k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11292 jkyall 25 0 1065m 452m 3280 R 90.2 1.4 165:42.07 mcnpx 23745 jkyall 25 0 1065m 447m 3280 R 90.2 1.4 174:33.67 mcnpx 28090 kuug74 25 0 4826m 4.5g 2764 R 88.5 14.8 214:31.25 nike3d 7441 s43in 19 0 96260 49m 1504 R 83.4 0.2 2:05.19 metgrid.exe 31310 kuug74 25 0 5344m 4.8g 2764 R 61.3 15.7 204:34.53 nike3d 21057 syyberg 16 0 21004 1072 860 D 39.1 0.0 18:29.97 tar 25645 s43in 18 0 125m 80m 1504 R 37.4 0.3 5:16.51 metgrid.exe 616 root 11 -5 0 0 0 D 35.7 0.0 163:04.87 kswapd2 614 root 11 -5 0 0 0 R 32.3 0.0 230:06.93 kswapd0 617 root 11 -5 0 0 0 D 28.9 0.0 173:51.87 kswapd3 615 root 11 -5 0 0 0 S 27.2 0.0 159:35.07 kswapd1 12716 neeao 16 0 23168 14m 4172 R 22.1 0.0 0:00.13 cc1plus 10158 h22ringr 15 0 419m 96m 37m R 18.7 0.3 0:04.33 viewer 5716 root 15 0 0 0 0 S 11.9 0.0 596:52.60 ptlrpcd 12708 blaise 15 0 13396 1932 840 R 8.5 0.0 0:00.08 top 12711 s43in 16 0 40996 3296 2276 S 3.4 0.0 0:00.02 vim 4361 kuug74 15 0 4091m 3.9g 2764 S 1.7 12.6 210:39.78 nike3d 4891 root 15 0 0 0 0 S 1.7 0.0 52:02.91 kiblnd_sd_02 10065 h22ringr 15 0 160m 17m 10m S 1.7 0.1 0:01.53 cli 1 root 15 0 10344 660 604 S 0.0 0.0 1:27.74 init 2 root RT -5 0 0 0 S 0.0 0.0 0:26.31 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.34 ksoftirqd/0 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 5 root RT -5 0 0 0 S 0.0 0.0 0:11.74 migration/1 6 root 34 19 0 0 0 S 0.0 0.0 0:00.90 ksoftirqd/1 7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 8 root RT -5 0 0 0 S 0.0 0.0 0:36.14 migration/2 9 root 34 19 0 0 0 R 0.0 0.0 0:01.64 ksoftirqd/2 10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 11 root RT -5 0 0 0 S 0.0 0.0 0:10.40 migration/3 |
Overview:
Example Displays and Reports:
Opening display of the entire call tree. Initial display is fully zoomed-out, demonstrating the main library clusters, calling arc summaries, and overall routine control flow. | |
Unclustered functions view. Same as previous display, but functions are not grouped within their library cluster. | |
Zoom-in view of main program. The smaller "Overview Window" allows easy positioning of desired view area. Zoomed-in area provides function details. | |
Collapsed library clusters. Zoomed-out view. Demonstrates how undesired information can be removed/summarized. |
Flat Profile Report. Similar to the same report produced by the gprof utility. Lists functions according to cpu usage. | |
Source Code Report showing accumulated ticks for code statement lines. | |
Call Graph Profile. Similar to the same report produced by by the gprof utility. | |
Library Statistics. Shows execution statistics for libraries called |
Using xprofiler:
Note: when you compile and link separately, you must use the -pg option with both the compile and link commands.
xprofiler | Starts without a file loaded. Must load file by using xprofiler's File pull-down menu. |
xprofiler myprog gmon.out | Loads the serial program with it's stat file |
xprofiler myprog gmon.out.N | Loads parallel program with selected stat file |
xprofiler myprog gmon.out.* | Loads parallel program with combined/merged stat files |
Note that there are also several command line flags available to define certain xprofiler characteristics and behaviors.
|
Important note: it is often necessary to uncluster functions and zoom-in to get to important detailed information. It is also usually useful to collapse/hide library information that isn't needed (like system libs).
Documentation: |
mpcc_r -g -o myprog myprog.c -L/usr/local/tools/mpiP/lib -lmpiP -lbfd -liberty -lintl -lm |
Environment Information Section |
@ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0 |
MPI Time Per Task |
------------------------------------------------------------------------------- @--- MPI Time (seconds) ------------------------------------------------------- ------------------------------------------------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06 |
Callsites |
------------------------------------------------------------------------------- @--- Callsites: 38 ------------------------------------------------------------ ------------------------------------------------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 12 Comm_rank rdopac rdopac.f 29 1000806c 13 Comm_rank sphot sphot.f 67 10008a04 14 Comm_size copyglob copyglob.f 12 10000390 15 Comm_size copypriv copypriv.f 76 10000c50 16 Comm_size sphot sphot.f 68 10008a1c 17 Finalize sphot sphot.f 279 10008f60 18 Init sphot sphot.f 66 100089e8 19 Irecv copyglob copyglob.f 47 10000a70 20 Irecv copypriv@OL@1 copypriv.f 162 1000184c 21 Irecv copypriv@OL@2 copypriv.f 205 10001db4 22 Irecv copypriv@OL@3 copypriv.f 254 1000221c 23 Irecv copypriv@OL@4 copypriv.f 297 10002704 24 Irecv sphot@OL@1@OL@2 sphot.f 201 1000974c 25 Reduce sphot sphot.f 258 10008eac 26 Reduce sphot sphot.f 262 10008eec 27 Send copyglob copyglob.f 60 10000b78 28 Send copypriv@OL@1 copypriv.f 191 10001cb8 29 Send copypriv@OL@2 copypriv.f 233 10002120 30 Send copypriv@OL@3 copypriv.f 275 10002608 31 Send copypriv@OL@4 copypriv.f 320 10002ae8 32 Send sphot@OL@1@OL@3 sphot.f 234 100099d8 33 Waitall copyglob copyglob.f 51 10000ac4 34 Waitall copypriv@OL@1 copypriv.f 168 10001928 35 Waitall copypriv@OL@2 copypriv.f 212 10001e94 36 Waitall copypriv@OL@3 copypriv.f 260 100022fc 37 Waitall copypriv@OL@4 copypriv.f 304 100027e4 38 Waitall sphot@OL@1@OL@2 sphot.f 206 100097b4 |
Aggregate Times of Top 20 Callsites |
------------------------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) -------------------- ------------------------------------------------------------------------------- Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 Send 27 0.755 0.00 0.00 Send 31 0.694 0.00 0.00 Waitall 37 0.42 0.00 0.00 Send 28 0.336 0.00 0.00 Waitall 35 0.21 0.00 0.00 Waitall 36 0.202 0.00 0.00 Waitall 38 0.2 0.00 0.00 Irecv 19 0.2 0.00 0.00 Reduce 26 0.185 0.00 0.00 Waitall 33 0.161 0.00 0.00 Comm_size 15 0.132 0.00 0.00 |
Callsite Statistics |
------------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------------- ------------------------------------------------------------------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 ..... [snip]..... Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 Send 32 1 1 0.027 0.027 0.027 0.00 0.00 Send 32 2 1 0.042 0.042 0.042 0.00 0.00 Send 32 3 1 0.043 0.043 0.043 0.00 0.00 Send 32 * 3 0.043 0.0373 0.027 0.00 0.00 Waitall 33 0 1 0.161 0.161 0.161 0.00 0.00 Waitall 33 * 1 0.161 0.161 0.161 0.00 0.00 Waitall 34 0 1 25.7 25.7 25.7 0.09 0.36 Waitall 34 * 1 25.7 25.7 25.7 0.02 0.09 Waitall 35 0 1 0.21 0.21 0.21 0.00 0.00 Waitall 35 * 1 0.21 0.21 0.21 0.00 0.00 Waitall 36 0 1 0.202 0.202 0.202 0.00 0.00 Waitall 36 * 1 0.202 0.202 0.202 0.00 0.00 Waitall 37 0 1 0.42 0.42 0.42 0.00 0.01 Waitall 37 * 1 0.42 0.42 0.42 0.00 0.00 Waitall 38 0 1 0.2 0.2 0.2 0.00 0.00 Waitall 38 * 1 0.2 0.2 0.2 0.00 0.00 |
Performance Analysis Tools |
Event Set 1 | Event Set 2 | Event Set 3 | Event Set 4 |
---|---|---|---|
Cycles | Cycles | Cycles | Cycles |
Inst. Completed | Inst. Completed | Inst. Completed | TLB Misses |
TLB Misses | TLB Misses | Inst. Cache Misses | Loads Completed |
Stores Completed | Stores Dispatched | FXU0 Operations | Stores Completed |
Loads Completed | L1 Store Misses | FXU1 Operations | L2 Load Misses |
FPU0 Operations | Loads Dispatched | FXU2 Operations | L2 Store Misses |
FPU1 Operations | L1 Load Misses | FPU0 Operations | Branches |
FMAs Executed | Load/Store Unit Idle | FPU1 Operations | Branches Mispredicted |
Using hpmcount:
hpmcount [-h] [-o <file name>] [-s <set>] [-e ev[,ev]*] programwhere:
-h
|
Prints help message |
-o <filename> |
Sets the output file name. For parallel jobs, each process gets a unique output file. |
-s <set> |
Defines one of four default event sets to measure. Event set 1 is the default. |
-e ev[,ev]* |
Manually defines a list of hardware counters to monitor. (Only for the truly adventurous.) |
Sample hpmcount output |
% hpmcount a.out -s 1 Running pipe()... s= 80200000000.0000000 Running unroll()... s= 51840000160.0000000 Running strength()... s= 84147.0984807570785 Running block()... c(N,N) = 67239936.0000000000 adding counter 5 event 12 Cycles adding counter 0 event 1 Instructions completed adding counter 7 event 0 TLB misses adding counter 2 event 9 Stores completed adding counter 3 event 5 Loads completed adding counter 4 event 5 FPU 0 instructions adding counter 1 event 35 FPU 1 instructions adding counter 6 event 9 FMAs executed hpmcount (V 2.3.1) summary Total execution time (wall clock time): 85.366748 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 75.190000 seconds Total amount of time in system mode : 0.090000 seconds Maximum resident set size : 7648 Kbytes Average shared memory use in text segment : 120432 Kbytes*sec Average unshared memory use in data segment : 26017908 Kbytes*sec Number of page faults without I/O activity : 1912 Number of page faults with I/O activity : 3 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 47 Number of involuntary context switches : 1321 ####### End of Resource Statistics ######## PM_CYC (Cycles) : 27845219686 PM_INST_CMPL (Instructions completed) : 20517718852 PM_TLB_MISS (TLB misses) : 138272209 PM_ST_CMPL (Stores completed) : 1722228911 PM_LD_CMPL (Loads completed) : 6770324612 PM_FPU0_CMPL (FPU 0 instructions) : 1485346931 PM_FPU1_CMPL (FPU 1 instructions) : 200209450 PM_EXEC_FMA (FMAs executed) : 1259225176 Utilization rate : 86.968 % Avg number of loads per TLB miss : 48.964 Load and store operations : 8492.554 M Instructions per load/store : 2.416 MIPS : 240.348 Instructions per cycle : 0.737 HW Float points instructions per Cycle : 0.061 Floating point instructions + FMAs : 2944.782 M Float point instructions + FMA rate : 34.496 Mflip/s FMA percentage : 85.522 % Computation intensity : 0.347 |
perfhpm[taskID].[pid]
which contains a text version of the
performance data
hpm[taskID]_[progName]_[pid].viz
which contains data for
the hpmviz program.
For example, a 4 task run of an executable called "swim" produces"
-rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0000_swim_53912.viz -rw------- 1 blaise blaise 31832 Jul 18 18:38 hpm0001_swim_53042.viz -rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0002_swim_51346.viz -rw------- 1 blaise blaise 31840 Jul 18 18:38 hpm0003_swim_56720.viz -rw------- 1 blaise blaise 52662 Jul 18 18:38 perfhpm0000.53912 -rw------- 1 blaise blaise 52633 Jul 18 18:38 perfhpm0001.53042 -rw------- 1 blaise blaise 52684 Jul 18 18:38 perfhpm0002.51346 -rw------- 1 blaise blaise 52673 Jul 18 18:38 perfhpm0003.56720
Example libhpm report |
hpmviz hpm0003_swim_56720.viz
pct - graphical mode.
Example here
pct -c - command line
pct -c -s - command line reading commands from a script file
pvt - starts the GUI
pvt filename(s) - starts GUI with one or more input files.
Example here
pvt -c - starts the command line interface
pvt -c filename(s) - starts command line interface
with one or more input files
IBM Parallel Environment for AIX: Operation and Use Volume 2 Tools Reference. Complete product documentation. |
Overview:
Example Displays:
Performance Toolbox monitor view 1 | |
Performance Toolbox monitor view 2: | |
Performance Toolbox 3D monitor view 1: multiple hosts | |
Performance Toolbox 3D monitor view 2: single host |
Using DPCL:
Documentation:
vmstat [options] n [m]n = interval in seconds
Sample output |
% vmstat 2 3 kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 2 1 504038 3489832 0 0 0 0 0 0 2760 8705 1730 4 3 92 0 1 0 503080 3490975 0 0 0 0 0 0 5676 8619 4277 0 15 85 0 0 0 503073 3490982 0 0 0 0 0 0 1937 2707 232 0 0 99 0* 1 page = 4096 bytes |
r = kernel threads placed on the run queue
b = kernel threads blocked avm = active virtual memory pages fre = free memory pages re= pager input/output list pi = page-ins from paging space | po = page-outs to paging space
fr = pages freed sr = pages scanned cy = scan cycles in = device interrupts sy = system calls | cs = context switches
us = % user cpu utilization sy = % system cpu utilization id = % cpu idle time wa = % cpu time waiting on I/O |
netstat [options] nn = interval in seconds
Sample output |
% netstat -i - shows state of all configured interfaces Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll en0 1500 link#2 0.4.ac.ec.11.e 319849 0 2787430 0 0 en0 1500 134.9.46 efrost067 319849 0 2787430 0 0 en1 1500 link#3 0.4.ac.7c.d6.3a 60416480 0 119342413 0 0 en1 1500 134.9.45 frost067-ge0.llnl 60416480 0 119342413 0 0 en2 1500 link#4 0.4.ac.7c.d3.1d 111043029 0 131259660 3 0 en2 1500 134.9.6 frost067-ge1.llnl 111043029 0 131259660 3 0 en3 9000 link#5 0.4.ac.7c.d6.43 12557524 0 14013408 0 0 en3 9000 134.9.60 frost067-ge2.llnl 12557524 0 14013408 0 0 en4 9000 link#6 0.4.ac.7c.d5.37 12555933 0 13775758 0 0 en4 9000 134.9.61 frost067-ge3.llnl 12555933 0 13775758 0 0 en5 9000 link#7 0.4.ac.7c.d5.36 12711192 0 13896158 0 0 en5 9000 134.9.62 frost067-ge4.llnl 12711192 0 13896158 0 0 en6 9000 link#8 0.4.ac.7c.d5.21 12408381 0 12936330 0 0 en6 9000 134.9.63 frost067-ge5.llnl 12408381 0 12936330 0 0 css0 65504 link#9 20162383 0 47045300 0 0 css0 65504 134.9.48 frost067-css0 20162383 0 47045300 0 0 css1 65504 link#10 20155508 0 47041708 0 0 css1 65504 134.9.49 frost067-css1 20155508 0 47041708 0 0 ml0 65504 link#11 0 0 75604311 4367 0 ml0 65504 134.9.47 frost067 0 0 75604311 4367 0 ml0 65504 134.9.47 frostgw 0 0 75604311 4367 0 lo0 16896 link#1 1902914 0 1903901 0 0 lo0 16896 127 localhost 1902914 0 1903901 0 0 lo0 16896 ::1 1902914 0 1903901 0 0 |
iostat [options] n [m]n = interval in seconds
Sample output |
% iostat 10 3 tty: tin tout avg-cpu: % user % sys % idle % iowait 1.9 569.4 4.1 3.4 92.2 0.3 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 1.2 79.3 6.2 497100 9818957 hdisk1 0.4 56.7 1.8 48840 7328280 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.3 1183.6 0.4 0.2 99.3 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 1.6 0.1 0 16 hdisk1 0.0 1.6 0.1 0 16 tty: tin tout avg-cpu: % user % sys % idle % iowait 0.1 418.2 0.1 0.7 99.2 0.0 Disks: % tm_act Kbps tps Kb_read Kb_wrtn hdisk0 3.9 67.7 13.0 0 677 hdisk1 0.2 10.4 1.5 0 104 |
Sample output |
% ps ux USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND userjoe 41728 62.9 0.0 352 396 pts/15 A 18:39:14 5:02 a.out userjoe 41728 22.3 33.0 412 478 pts/15 A 18:39:14 5:02 b.out userjoe 41502 0.0 0.0 696 828 pts/15 A 17:25:27 0:03 -tcsh userjoe 13334 0.0 0.0 200 252 pts/15 A 18:40:14 0:00 ps ux |
mpi_trace:
MPIMap:
And More...
This completes the tutorial.
Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise. |
Where would you like to go now?
References and More Information |