Dynamic analysis of optimizations

The listing below shows a dynamic timing analysis on the IBM Full System Simulator for the Cell Broadband Engine for the optimized SPE thread (process buffer only). It shows that 78 registers are used, so the used percentage is 60.94.

  SPU DD1.0
  ***
  Total Cycle count               7134843
  Total Instruction count         10602009
  Total CPI                       0.67
  ***
  Performance Cycle count         7134843
  Performance Instruction count   10602009 (9839265)
  Performance CPI                 0.67 (0.73)
  
  Branch instructions             253940
  Branch taken                    251967
  Branch not taken                1973
  
  Hint instructions               2952
  Hint hit                        250980
  
  Contention at LS between Load/Store and Prefetch 6871
  
  Single cycle                                           3815689 ( 53.5%)
  Dual cycle                                             3011788 ( 42.2%)
  Nop cycle                                                 5898 (  0.1%)
  Stall due to branch miss                                 34655 (  0.5%)
  Stall due to prefetch miss                                   0 (  0.0%)
  Stall due to dependency                                 266732 (  3.7%)
  Stall due to fp resource conflict                            0 (  0.0%)
  Stall due to waiting for hint target                        72 (  0.0%)
  Stall due to dp pipeline                                     0 (  0.0%)
  Channel stall cycle                                          0 (  0.0%)
  SPU Initialization cycle                                     9 (  0.0%)
  -----------------------------------------------------------------------
  Total cycle                                            7134843 (100.0%)
  
  Stall cycles due to dependency on each pipelines
   FX2        8808
   SHUF       1971
   FX3        5870
   LS         32
   BR         0
   SPR        1
   LNOP       0
   NOP        0
   FXB        0
   FP6        250050
   FP7        0
   FPD        0

  The number of used registers are 78, the used ratio is 60.94

The above static and dynamic timing analysis of the optimized SPE code reveals:

Significant increase in dual-issue rate and reduction in dependency stalls. The static analysis shows that the process_buffer inner loop still contains a single-cycle stall and some instructions that are not dual-issued. Further performance improvements could likely be achieved by either more loop unrolling or software loop-pipelining.
The number of instructions has decreased by 41% from the initial instruction count.
The CPI has dropped from 2.39 to a more typical 0.73.
The performance of the SPE code, measured in total cycle count, has gone from approximately 43 M cycles to 7 M cycles, an improvement of more than 6x. This improvement does not take into account the DMA latency-hiding (stall elimination) provided by double buffering.

For details about performance simulation, including examples of coding for simulations, see The simulator. The IBM Full System Simulator for the Cell Broadband Engine described in that chapter supports performance simulation for a full system, including the MFCs, caches, bus, and memory controller. )