This section contains a short summary of general tips for optimizing
the performance of SPE programs.
- Local Store
- Design for the LS size. The LS holds up to 256 KB for the program, stack,
local data structures, and DMA buffers. One can do a lot with 256 KB, but
be aware of this size.
- Use overlays (runtime download program kernels) to build complex function
servers in the LS (see SPE overlays).
- DMA Transfers
- Use SPE-initiated DMA transfers rather than PPE-initiated DMA transfers.
There are more SPEs than the one PPE, and the PPE can enqueue only eight DMA
requests whereas each SPE can enqueue 16.
- Overlap DMA with computation by double buffering or multibuffering (see Moving double-buffered data). Multibuffer code or (typically)
data.
- Use double buffering to hide memory latency.
- Use fence command options to order DMA transfers within
a tag group.
- Use barrier command options to order DMA transfers within
the queue.
- Loops
- Unroll loops to reduce dependencies and increase dual-issue rates. This
exploits the large SPU register file.
- Compiler auto-unrolling is not perfect, but pretty good.
- SIMD Strategy
- Choose an SIMD strategy appropriate for your algorithm. For example:
- Evaluate array-of-structure (AOS) organization. For graphics vertices,
this organization (also called or vector-across) can have more-efficient code
size and simpler DMA needs, but less-efficient computation unless the code
is unrolled.
- Evaluate structure-of-arrays (SOA) organization. For graphics vertices,
this organization (also called parallel-array) can be easier to SIMDize,
but the data must be maintained in separate arrays or the SPU must shuffle
AOS data into an SOA form.
- Consider the effects of unrolling when choosing an SIMD strategy.
- Load/Store
- Scalar loads and stores are slow, with long latency.
- SPUs only support quadword loads and stores.
- Consider making scalars into quadword integer vectors.
- Load or store scalar arrays as quadwords, and perform your own extraction
and insertion to eliminate load and store instructions.
- Branches
- Eliminate nonpredicted branches.
- Use feedback-directed optimization.
- Use the __builtin_expect language directive when you
can explicitly direct branch prediction.
- Multiplies
- Avoid integer multiplies on operands greater than 16 bits in size. The
SPU supports only a "16-bit x16-bit multiply". A "32-bit multiply" requires
five instructions (three 16-bit multiplies and two adds).
- Keep array elements sized to a power-of-2 to avoid multiplies when indexing.
- Cast operands to unsigned short prior to multiplying.
Constants are of type int and also require casting. Use a
macro to explicitly perform 16-bit multiplies. This can avoid inadvertent
introduction of signed extends and masks due to casting.
- Pointers
- Use the PPE's load/store with update instructions. These
allow sequential indexing through an array without the need of additional
instructions to increment the array pointer.
- For the SPEs (which do not support load/store with update instructions),
use the d-form instructions to specify an immediate offset from a base array
pointer.
- Dual-Issue
- Choose intrinsics carefully to maximize dual-issue rates or reduce latencies.
- Dual issue will occur if a pipe-0 instruction is even-addressed,
a pipe-1 instruction is odd-addressed, and there are no dependencies
(operands are available).
- Code generators use nops to align instructions for dual-issue.
- Use software pipeline loops to improve dual-issue rates.