General SPE programming tips

This section contains a short summary of general tips for optimizing the performance of SPE programs.

Local Store
- Design for the LS size. The LS holds up to 256 KB for the program, stack, local data structures, and DMA buffers. One can do a lot with 256 KB, but be aware of this size.
- Use overlays (runtime download program kernels) to build complex function servers in the LS (see SPE overlays).
DMA Transfers
- Use SPE-initiated DMA transfers rather than PPE-initiated DMA transfers. There are more SPEs than the one PPE, and the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16.
- Overlap DMA with computation by double buffering or multibuffering (see Moving double-buffered data). Multibuffer code or (typically) data.
- Use double buffering to hide memory latency.
- Use fence command options to order DMA transfers within a tag group.
- Use barrier command options to order DMA transfers within the queue.
Loops
- Unroll loops to reduce dependencies and increase dual-issue rates. This exploits the large SPU register file.
- Compiler auto-unrolling is not perfect, but pretty good.
SIMD Strategy
- Choose an SIMD strategy appropriate for your algorithm. For example:
- Evaluate array-of-structure (AOS) organization. For graphics vertices, this organization (also called or vector-across) can have more-efficient code size and simpler DMA needs, but less-efficient computation unless the code is unrolled.
- Evaluate structure-of-arrays (SOA) organization. For graphics vertices, this organization (also called parallel-array) can be easier to SIMDize, but the data must be maintained in separate arrays or the SPU must shuffle AOS data into an SOA form.
- Consider the effects of unrolling when choosing an SIMD strategy.
Load/Store
- Scalar loads and stores are slow, with long latency.
- SPUs only support quadword loads and stores.
- Consider making scalars into quadword integer vectors.
- Load or store scalar arrays as quadwords, and perform your own extraction and insertion to eliminate load and store instructions.
Branches
- Eliminate nonpredicted branches.
- Use feedback-directed optimization.
- Use the __builtin_expect language directive when you can explicitly direct branch prediction.
Multiplies
- Avoid integer multiplies on operands greater than 16 bits in size. The SPU supports only a "16-bit x16-bit multiply". A "32-bit multiply" requires five instructions (three 16-bit multiplies and two adds).
- Keep array elements sized to a power-of-2 to avoid multiplies when indexing.
- Cast operands to unsigned short prior to multiplying. Constants are of type int and also require casting. Use a macro to explicitly perform 16-bit multiplies. This can avoid inadvertent introduction of signed extends and masks due to casting.
Pointers
- Use the PPE's load/store with update instructions. These allow sequential indexing through an array without the need of additional instructions to increment the array pointer.
- For the SPEs (which do not support load/store with update instructions), use the d-form instructions to specify an immediate offset from a base array pointer.
Dual-Issue
- Choose intrinsics carefully to maximize dual-issue rates or reduce latencies.
- Dual issue will occur if a pipe-0 instruction is even-addressed, a pipe-1 instruction is odd-addressed, and there are no dependencies (operands are available).
- Code generators use nops to align instructions for dual-issue.
- Use software pipeline loops to improve dual-issue rates.