Local Store

The local store (LS) can be regarded as a software-controlled cache that is filled and emptied by DMA transfers.

Key features of the LS include:

Holds instructions and data
16-bytes-per-cycle load and store bandwidth, quadword aligned only
128-bytes-per-cycle DMA-transfer bandwidth
128-byte instruction prefetch per cycle

Competition might occur for access to the LS by:

loads,
stores,
DMA reads,
DMA writes,
instruction fetches.

The SPU arbitrates access to the LS according the following priorities (with the highest priority first):

DMA reads and writes by the PPE or an I/O device.
SPU loads and stores.
Instruction prefetch.

Table 1 summarizes the LS-arbitration priorities and transfer sizes. DMA reads and writes always have highest priority. Because hardware supports 128-bit DMA reads and writes, these operations occupy, at most, one of every eight cycles (one of sixteen for DMA reads, and one of sixteen for DMA writes) to the LS. Thus, except for highly optimized code, the impact of DMA reads and writes on LS availability for loads, stores, and instruction fetches can be ignored.

Table 1. LS-Access Arbitration Priority and Transfer Size
Transaction	Transfer Size (Bytes)	Priority	Maximum Local Store Occupancy (SPU Cycle)	Access Path
MMIO	≤ 16	1-Highest	1/8	Line Interface
DMA	≤ 128	1	1/8	Line Interface
DMA-List Transfer-Element Fetch	128	1	1/4	Quadword Interface
ECC Scrub	16	2	1/10
SPU Load/Store	16	3	1
Hint Fetch	128	3	1	Line Interface
Inline Fetch	128	4-Lowest	1/16 for inline code	Line Interface

After DMA reads and writes, the next-highest user-initiated priority is given to load and store instructions. The rationale for doing so is that load and store instructions usually help a program's progress, whereas instruction fetches are often speculative. The SPE supports only 16-byte load and store operations that are 16-byte-aligned. It uses a second instruction (byte shuffle) to place bytes in a different order if, for example, the program requires only a 4-byte quantity or a quantity with a different data alignment. To store something that is not aligned, use a read-modify-write operation.

The lowest priority for LS access is given to instruction fetches, of which there are three types: flush-initiated fetches, inline prefetches, and hint fetches. Instruction fetches load 32 instructions per SPU request by accessing all banks of the LS simultaneously. Because the LS is single-ported, it is important that DMA and instruction-fetch activity transfer as much useful data as possible in each LS request.