Programming Tutorial
Note
Edition notice
Preface
Overview of the Cell Broadband Engine
Introduction
Background and motivations
Scaling the three performance-limiting walls
Scaling the power-limitation wall
Scaling the memory-limitation wall
Scaling the frequency-limitation wall
How the Cell Broadband Engine overcomes performance limitations
Architecture overview
The PowerPC Processor Element
Synergistic Processor Elements
Programming Overview
Byte ordering and bit numbering
SIMD vectorization
SIMD C-language intrinsics
Threads and tasks
The runtime environment
Application partitioning
The software development kit
The PPE and the programming process
PPE registers
PPE instruction sets
PowerPC instructions
Addressing modes
Instruction types
Compatibility with existing PowerPC code
Vector/SIMD Multimedia Extension instructions
Addressing modes
Instruction types
C/C++ language extensions (intrinsics)
Scalar intrinsics
Vector data types
Vector intrinsics
Programming with Vector/SIMD Multimedia Extension intrinsics
Example: incorporating Vector instructions into a PPE program
Example: array-summing
The PPE and the SPEs
Storage Domains
Issuing DMA commands from the PPE
Creating threads for the SPEs
Communication between the PPE and SPEs
Developing code for the Cell Broadband Engine
Producing a simple multi-threaded CBE program
Running the program in the simulator
Debugging programs
Programming the SPEs
SPE configuration
Synergistic Processor Unit
SPE registers
Floating-point operations
Local Store
Pipelines and dual-issue rules
Memory flow controller
Channels
Channel instructions
Mailboxes
Signal notification
SPU instruction set
Data layout in registers
Instruction types
SPU C/C++ language extensions (intrinsics)
Assembly language versus intrinsics comparison: an example
Intrinsic classes
Specific intrinsics
Generic intrinsics
Composite SPU intrinsics
Promoting scalar data types to vector data types
Differences between PPE and SPE SIMD support
Architectural differences between PPE and SPE SIMD support
Language-extension differences between PPE and SPE SIMD support
Compiler directives
MFC commands
DMA-command tag groups
Synchronizing DMA transfers
MFC input and output macros
Coding methods and examples
DMA transfers
DMA-list transfers
Creating the DMA list
Initiating the transfers specified in the DMA list
DMA-list transfers: programming example
Moving double-buffered data
Vectorizing a loop
Reducing the impact of branches
Function-inlining and loop-unrolling
Predication using select-bits instruction
Reducing branch mispredicts with branch hint
Porting SIMD code from the PPE to the SPEs
Code-mapping considerations
Code-mapping performance considerations
Unmappable constructs considerations
Limited size of LS considerations
Equivalent precision considerations
Simple macro translation
Example 1: Euler particle-system simulation
Initial scalar code
Step 1: SIMDize the code for execution on the PPE
Step 2: Port the PPE code for execution on the SPE
Step 3: Parallelize code for execution across multiple SPEs
Performance analysis
Performance issues
Example 1: Tuning SPE performance with static and dynamic timing analysis
Static analysis of SPE threads
Dynamic analysis of SPE threads
Optimizations
Static analysis of optimization
Dynamic analysis of optimizations
General SPE programming tips
Programming models
Function-Offload Model
Remote procedure call
Device-Extension Model
Computation-Acceleration Model
Streaming model
Shared-Memory Multiprocessor Model
Asymmetric-Thread Runtime Model
User-mode thread model
Cell application frameworks
SPE overlays
The simulator
Simulator basics
Operating-system modes
Linux mode
Standalone mode
Interacting with the simulator
Command-line interface
Graphical User Interface
The simulation panel
PPE components
SPE components
GUI buttons
Performance monitoring
Displaying performance statistics
SPE performance profile checkpoints
Example program: tpa1
Emitters
SPU performance and semantics
Notices
Trademarks
Glossary
Index