Local Store

The local store (LS) can be regarded as a software-controlled cache that is filled and emptied by DMA transfers.

Key features of the LS include:
Competition might occur for access to the LS by:
The SPU arbitrates access to the LS according the following priorities (with the highest priority first):
  1. DMA reads and writes by the PPE or an I/O device.
  2. SPU loads and stores.
  3. Instruction prefetch.

Table 1 summarizes the LS-arbitration priorities and transfer sizes. DMA reads and writes always have highest priority. Because hardware supports 128-bit DMA reads and writes, these operations occupy, at most, one of every eight cycles (one of sixteen for DMA reads, and one of sixteen for DMA writes) to the LS. Thus, except for highly optimized code, the impact of DMA reads and writes on LS availability for loads, stores, and instruction fetches can be ignored.

Table 1. LS-Access Arbitration Priority and Transfer Size
Transaction Transfer Size (Bytes) Priority Maximum Local Store Occupancy (SPU Cycle) Access Path
MMIO 16 1-Highest 1/8 Line Interface
DMA 128 1

DMA-List
Transfer-Element Fetch

128 1 1/4 Quadword Interface
ECC Scrub 16 2 1/10
SPU Load/Store 16 3 1
Hint Fetch 128 3 1 Line Interface
Inline Fetch 128 4-Lowest 1/16 for inline code

After DMA reads and writes, the next-highest user-initiated priority is given to load and store instructions. The rationale for doing so is that load and store instructions usually help a program's progress, whereas instruction fetches are often speculative. The SPE supports only 16-byte load and store operations that are 16-byte-aligned. It uses a second instruction (byte shuffle) to place bytes in a different order if, for example, the program requires only a 4-byte quantity or a quantity with a different data alignment. To store something that is not aligned, use a read-modify-write operation.

The lowest priority for LS access is given to instruction fetches, of which there are three types: flush-initiated fetches, inline prefetches, and hint fetches. Instruction fetches load 32 instructions per SPU request by accessing all banks of the LS simultaneously. Because the LS is single-ported, it is important that DMA and instruction-fetch activity transfer as much useful data as possible in each LS request.