The local store (LS) can be regarded as a software-controlled cache that is filled and emptied by DMA transfers.
Table 1 summarizes the LS-arbitration priorities and transfer sizes. DMA reads and writes always have highest priority. Because hardware supports 128-bit DMA reads and writes, these operations occupy, at most, one of every eight cycles (one of sixteen for DMA reads, and one of sixteen for DMA writes) to the LS. Thus, except for highly optimized code, the impact of DMA reads and writes on LS availability for loads, stores, and instruction fetches can be ignored.
Transaction | Transfer Size (Bytes) | Priority | Maximum Local Store Occupancy (SPU Cycle) | Access Path |
---|---|---|---|---|
MMIO | ≤ 16 | 1-Highest | 1/8 | Line Interface |
DMA | ≤ 128 | 1 | ||
DMA-List |
128 | 1 | 1/4 | Quadword Interface |
ECC Scrub | 16 | 2 | 1/10 | |
SPU Load/Store | 16 | 3 | 1 | |
Hint Fetch | 128 | 3 | 1 | Line Interface |
Inline Fetch | 128 | 4-Lowest | 1/16 for inline code |
After DMA reads and writes, the next-highest user-initiated priority is given to load and store instructions. The rationale for doing so is that load and store instructions usually help a program's progress, whereas instruction fetches are often speculative. The SPE supports only 16-byte load and store operations that are 16-byte-aligned. It uses a second instruction (byte shuffle) to place bytes in a different order if, for example, the program requires only a 4-byte quantity or a quantity with a different data alignment. To store something that is not aligned, use a read-modify-write operation.
The lowest priority for LS access is given to instruction fetches, of which there are three types: flush-initiated fetches, inline prefetches, and hint fetches. Instruction fetches load 32 instructions per SPU request by accessing all banks of the LS simultaneously. Because the LS is single-ported, it is important that DMA and instruction-fetch activity transfer as much useful data as possible in each LS request.