Displaying performance statistics

You can collect and display simple performance statistics on a program without performing any instrumentation of the program code. Collection of more complex statistics requires program instrumentation.

The following steps demonstrate how to collect and display simple performance statistics. The example PPE program starts ("spawns") the same thread on three SPEs.

When an SPE thread is spawned, its SPE number (any number between 0 and 7) is passed in a data structure as a parameter to the main function.
The SPE program contains a for-loop that is executed zero or more times.
The number of times it is executed is equal to three times the value passed to its main function.

The names of the PPE and SPE programs are tpa1 and tpa1_spu, respectively. Part of the most important sections of the programs are shown in Example program: tpa1.

The following steps are marked as to whether they are performed in the simulator's command window or its console window. To collect and display simple performance statistics, do the following:

Start the simulator. Start the simulator by entering the following command:
```
	PATH=/opt/ibm/systemsin-cell/bin:$PATH; systemsim
```
This command starts the simulator in command-line mode, and displays the simulator prompt.
```
	systemsim %
```
In the command window, set the SPUs to pipeline mode. An SPU must be in pipeline mode to collect performance statistics from that SPU. If, instead, the SPU is in instruction mode, it will only report the total instruction count. Use the mysim spu command to set those processors to pipeline mode, as follows:
```
	mysim spu 0 set model pipeline
	mysim spu 1 set model pipeline
	mysim spu 2 set model pipeline
```
Note: The specific SPU numbers are examples only. The operating system may assign the SPU programs to execute on a different set of SPUs. You can also use the SPU Modes button or the folder under each SPE labeled Model to set the model to pipeline mode.
In the command window, boot Linux. Boot the Linux operating system on the simulated PPE by entering:
```
	mysim go
```
In the console window, load the executables. Load the PPE and SPE executables from the base environment into the simulated environment, and set their file permissions to executable, as follows:
```
	callthru source tpa1 > tpa1
	callthru source tpa1_spu > tpa1_spu
	chmod +x tpa1
	chmod +x tpa1_spu
```
In the console window, run the PPE program. Run the PPE program in the simulation by entering the name of the executable file, as follows:
```
	tpa1
```
In the command window, pause the simulation and display statistics. When the program finishes execution, select the simulator control window. Pause the simulator by entering the Ctrl-c key sequence. To display the performance statistics for the three SPEs, enter the following commands:
```
	mysim spu 0 display statistics
	mysim spu 1 display statistics
	mysim spu 2 display statistics
```

As each command is entered, the simulator displays the performance statistics in the simulator command window. Figure 1 shows a screen image of the SPE 0 performance statistics.

Figure 1. tpa1 statistics for SPE 0

SPU DD3.0
***
Total Cycle count               35185
Total Instruction count         643
Total CPI                       54.72
***
Performance Cycle count         35185
Performance Instruction count   1701 (1502)
Performance CPI                 20.68 (23.43)

Branch instructions             135
Branch taken                    120
Branch not taken                15

Hint instructions               9
Hint hit                        31

Contention at LS between Load/Store and Prefetch 49

Single cycle                                              1108 (  3.1%)
Dual cycle                                                 197 (  0.6%)
Nop cycle                                                  137 (  0.4%)
Stall due to branch miss                                  1655 (  4.7%)
Stall due to prefetch miss                                   0 (  0.0%)
Stall due to dependency                                    826 (  2.3%)
Stall due to fp resource conflict                            0 (  0.0%)
Stall due to waiting for hint target                        11 (  0.0%)
Issue stalls due to pipe hazards                             6 (  0.0%)
Channel stall cycle                                      31236 ( 88.8%)
SPU Initialization cycle                                     9 (  0.0%)
-----------------------------------------------------------------------
Total cycle                                              35185 (100.0%)

Stall cycles due to dependency on each pipelines
 FX2        62 (  7.5% of all dependency stalls)
 SHUF       322 ( 39.0% of all dependency stalls)
 FX3        2 (  0.2% of all dependency stalls)
 LS         413 ( 50.0% of all dependency stalls)
 BR         0 (  0.0% of all dependency stalls)
 SPR        21 (  2.5% of all dependency stalls)
 LNOP       0 (  0.0% of all dependency stalls)
 NOP        0 (  0.0% of all dependency stalls)
 FXB        0 (  0.0% of all dependency stalls)
 FP6        0 (  0.0% of all dependency stalls)
 FP7        0 (  0.0% of all dependency stalls)
 FPD        6 (  0.7% of all dependency stalls)

The number of used registers are 128, the used ratio is 100.00
dumped pipeline stats

Although the programs on SPE 0 and SPE 2 are the same, the program on SPE 0 executed the loop zero times, but the program on SPE 2 executed the loop six times.

You can compare the performance statistics of SPE 0 (Figure 1) with those of SPE 2, which are shown in Figure 2.

Note: The statistics collected in this manner include the SPU cycles required to load the SPE thread, start the SPE thread, and cleanup the SPE thread upon completion.

Figure 2. tpa1 statistics for SPE 2

SPU DD3.0
***
Total Cycle count               35537
Total Instruction count         643
Total CPI                       55.27
***
Performance Cycle count         35537
Performance Instruction count   1802 (1590)
Performance CPI                 19.72 (22.35)

Branch instructions             153
Branch taken                    136
Branch not taken                17

Hint instructions               15
Hint hit                        37

Contention at LS between Load/Store and Prefetch 49

Single cycle                                              1170 (  3.3%)
Dual cycle                                                 210 (  0.6%)
Nop cycle                                                  150 (  0.4%)
Stall due to branch miss                                  1854 (  5.2%)
Stall due to prefetch miss                                   0 (  0.0%)
Stall due to dependency                                    879 (  2.5%)
Stall due to fp resource conflict                            0 (  0.0%)
Stall due to waiting for hint target                        23 (  0.1%)
Issue stalls due to pipe hazards                             6 (  0.0%)
Channel stall cycle                                      31236 ( 87.9%)
SPU Initialization cycle                                     9 (  0.0%)
-----------------------------------------------------------------------
Total cycle                                              35537 (100.0%)

Stall cycles due to dependency on each pipelines
 FX2        86 (  9.8% of all dependency stalls)
 SHUF       348 ( 39.6% of all dependency stalls)
 FX3        2 (  0.2% of all dependency stalls)
 LS         413 ( 47.0% of all dependency stalls)
 BR         3 (  0.3% of all dependency stalls)
 SPR        21 (  2.4% of all dependency stalls)
 LNOP       0 (  0.0% of all dependency stalls)
 NOP        0 (  0.0% of all dependency stalls)
 FXB        0 (  0.0% of all dependency stalls)
 FP6        0 (  0.0% of all dependency stalls)
 FP7        0 (  0.0% of all dependency stalls)
 FPD        6 (  0.7% of all dependency stalls)

The number of used registers are 128, the used ratio is 100.00
dumped pipeline stats