Although the Cell Broadband Engine is initially intended for application in game consoles and media-rich consumer-electronics devices such as high-definition televisions, the architecture and the Cell Broadband Engine implementation have been designed to enable fundamental advances in processor performance. A much broader use of the architecture is envisioned.
The PPE (the first type of processor element) is a 64-bit PowerPC Architecture core. It is fully compliant with the 64-bit PowerPC Architecture and can run 32-bit and 64-bit operating systems and applications.
The SPE (the second type of processor element) is optimized for running compute-intensive applications, and it is not optimized for running an operating system. The SPEs are independent processors, each running its own individual application programs. Each SPE has full access to coherent shared memory, including the memory-mapped I/O space.
The designation synergistic for this processor was chosen carefully because there is a mutual dependence between the PPE and the SPEs. The SPEs depend on the PPE to run the operating system, and, in many cases, the top-level control thread of an application. The PPE depends on the SPEs to provide the bulk of the application performance.
The SPEs are designed to be programmed in high-level languages and support a rich instruction set that includes extensive single-instruction, multiple-data (SIMD) functionality. However, just like conventional processors with SIMD extensions, use of SIMD data types is preferred, not mandatory. For programming convenience, the PPE also supports the PowerPC Architecture Vector/SIMD Multimedia Extension.
To an application programmer, the Cell Broadband Engine looks like a 9-way coherent multiprocessor. The PPE is more adept at control-intensive tasks and quicker at task switching. The SPEs are more adept at compute-intensive tasks and slower at task switching. However, either processor is capable of both types of functions. This specialization has allowed increased efficiency in the implementation of both the PPE and especially the SPEs. It is a significant factor in the approximate order-of-magnitude improvement in peak computational performance and area-and-power efficiency that the Cell Broadband Engine achieves over conventional PC processors.
The reason for this radical change is that memory latency, measured in processor cycles, has gone up several hundredfold in the last 20 years. The result is that application performance is, in most cases, limited by memory latency rather than by peak compute capability or peak bandwidth. When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. Compared to this penalty, the few cycles it takes to set up a DMA transfer for an SPE is quite small. Conventional processors, even with deep and costly speculation, manage to get, at best, a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available. In contrast, the explicit DMA model allows each SPE to have many concurrent memory accesses in flight, without the need for speculation.
The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE’s local store, so that the SPE’s DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than one would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.
It is also possible to write compilers that manage an SPE’s local Store as a very large second-level register file or to automatically bring in code when needed and present a conventional symmetric multiprocessing (SMP) model. Although such a compiler exists, at least in prototype form, it does not today result in the most optimal application performance. Hence, this tutorial focuses on approaches to programming the Cell Broadband Engine that expose the local store and the asynchronous DMA-transfer commands.