## **Why Massively Parallel Processor** A quiet revolution and potential build-up (G80 numbers) • Calculation: 544 GFLOPS vs. 264.96 GFLOPS (FP-64) Memory Bandwidth: 153.6GB/s vs. 25.6 GB/s • Until recently, programmed through graphics API GPU in every PC and workstation - massive volume and potential #### **Future Apps in Concurrent World** - Exciting applications in future mass computing market - Molecular dynamics simulation - Video and audio coding and manipulation - 3D imaging and visualization Consumer game physics Virtual reality products - Various granularities of parallelism exist, but... programming model must not hinder parallel implementation - data delivery needs careful management - Introducing domain-specific architecture CUDA for GPGPU #### What is GPGPU? - General Purpose computation using GPU in applications (other than 3D graphics) - GPU accelerates critical path of application - Data parallel algorithms leverage GPU attributes - Large data arrays, streaming throughput - Fine-grain SIMD (single-instruction multiple-data) parallelism - Low-latency floating point (FP) computation - Applications see //GPGPU.org - Game effects (FX) physics, image processing - Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting #### **GPU and CPU: The Differences** - GPU - More transistors devoted to computation, instead of caching - Suitable for data-intensive computation - -High arithmetic/memory operation ratio #### **CUDA** - "Compute Unified Device Architecture" - General purpose programming model User kicks off batches of threads on the GPU - GPU = dedicated super-threaded, massively data parallel co-processor - Targeted software stack - Compute oriented drivers, language, and tools - Driver for loading computation programs into GPU - grants into ero Standalone Driver Optimized for computation Guaranteed maximum download & readback speeds Explicit GPU memory - management #### **CUDA Programming Model** - The GPU is viewed as a compute device that: - Is a coprocessor to the CPU or host - Has its own DRAM (device memory) - Runs many threads in parallel Hardware switching between threads (in 1 cycle) on long-latency memory reference - Overprovision (1000s of threads) $\rightarrow$ hide latencies - Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads - Differences between GPU and CPU threads - GPU threads are extremely lightweight - Very little creation overhead - GPU needs 1000s of threads for full efficiency - Multi-core CPU needs only a few #### Thread Batching: Grids and Blocks Kernel executed as a grid of thread All threads share data memory space Thread block is a batch of threads, can cooperate with each other by: cooperate with each other by: Synchronizing their execution: For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate (Unless thru slow global memory) Threads and blocks have IDs Courtesy: NDVIA #### **Extended C** - Declspecs global, device, shared, local, constant - Keywords threadIdx, blockIdx - Intrinsics - \_\_syncthreads - Runtime API - Memory, symbol, execution management - Function launch - \_\_device\_\_ float filter[N]; \_\_global\_\_ void convolve (float \*image) { \_\_shared\_\_ float region[M]; - region[threadIdx] = image[i]; - \_\_syncthreads() image[j] = result; - // Allocate GPU memory void \*myimage = cudaMalloc(bytes) - // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); #### **CUDA Function Declarations** | | Executed on the: | Only callable<br>from the: | |---------------------------|------------------|----------------------------| | device float DeviceFunc() | device | device | | global void KernelFunc() | device | Host | | host float HostFunc() | host | Host | - \_global\_ defines a kernel function - Must return void memories • \_\_device\_\_ and \_\_host\_\_ can be used together # Global, Constant, and Texture Memories (Long Latency Accesses) • Global memory — Main means of communicating R/W Data between host and device — Contents visible to all threads • Texture and Constant Memories — Constants initialized by host — Contents visible to all threads #### Calling Kernel Function - Thread Creation • A kernel function must be called with an execution configuration: ``` __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); ``` - Any call to a kernel function is asynchronous (CUDA 1.0 & later), explicit synch needed for blocking - Recursion in kernels supported (in 5.0/Kepler+) ### Sample Code: Increment Array ``` main() { float *a,h, *a_d; int i, N=10; size_t size = N*sizeof(float); a.h = (float *)malloc(size); for (i=0; i<N; i++) a_h[i] = (float)i; blockOmx _____ blockldx.x 0 // allocate array on device cudaMalloc((void **) &a_d, size); // copy data from host to device cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice); // do calculation on device: // Part l of 2. Compute execution configuration **Reseduc: // Part l of 2. Compute execution configuration **Reseduc: // cleanup free(a_h); cudaFree(a d); int idx = blockIdx.x*blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx]+1.f;</pre> ``` #### **Execution model** #### $\hbox{Multiple levels of parallelism}$ - Thread block - Max. 1024 threads/block - Communication through shared memory (fast) - Thread guaranteed to be resident - \_\_syncthreads() → barrier for this block only! avoid RAW/WAR/WAW hazards when ref' shared/global memory - · Grid of thread blocks - F<<<nblocks, nthreads>>>(a, b, c) ## Hardware Implementation: Execution Model - Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) - The device processes only one grid at a time - Each thread block is executed by one multiprocessor - So that the shared memory space resides in the on-chip shared memory - A multiprocessor can execute multiple blocks concurrently - Shared memory and registers are partitioned among the threads of all concurrent blocks - So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently #### Threads, Warps, Blocks - There are (up to) 32 threads in a Warp - Only <32 when there are fewer than 32 total threads - There are (up to) 32 Warps in a Block - Each Block (and thus, each Warp) executes on a single SM - GF110 has 16 SMs - At least 16 Blocks required to "fill" the device - More is better - If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM ... #### **More Terminology Review** - device = GPU = set of multiprocessors - Multiprocessor = set of processors & shared memory - Kernel = GPU program - Grid = array of thread blocks that execute a kernel - Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory | Memory | Location | Cached | Access | Who | |----------|----------|----------------|------------|------------------------| | Local | Off-chip | No | Read/write | One thread | | Shared | On-chip | N/A - resident | Read/write | All threads in a block | | Global | Off-chip | No | Read/write | All threads + host | | Constant | Off-chip | Yes | Read | All threads + host | | Texture | Off-chip | Yes | Read | All threads + host | 20 #### **Access Times** - Register dedicated HW single cycle - Shared Memory dedicated HW single cycle - Local Memory DRAM, no cache \*slow\* - Global Memory DRAM, no cache \*slow\* - Constant Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality - Texture Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality - Instruction Memory (invisible) DRAM, cached | | | | _ | |--|--|--|-------| | | | | _ | | | | | _ | | | | | <br>_ | | | | | _ | | | | | | | | | | | | | | | | | | | | | # Using per-block shared memory • Variables shared across block int \*begin, \*end; • Scratchpad memory \_\_shared\_\_ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // ... compute on scratch values ... begin[threadIdx.x] = scratch[threadIdx.x]; • Communicating values between threads scratch[threadIdx.x] = begin[treadIdx.x]; \_\_syncthreads(); int left = scratch[threadIdx.x - 1]; ## Summing up a sequence with 1 thread: int sum = 0; for (int i=0; i<N; ++i) sum += x[i]; Parallel reduction builds a summation tree — each thread holds 1 element — stepwise partial sums — N threads need log N steps — one possible approach: Butterfly pattern #### **Parallel Reduction for 1 Block** // INPUT: Thread i holds value x i int i = threadIdx.x: shared\_\_ int sum[blocksize]; // One thread per element for(int bit=blocksize/2; bit>0; bit/=2) int t=sum[i]+sum[i^bit]; \_\_syncthreads(); sum[i]=t; \_\_syncthreads(); // OUTPUT: Every thread now holds sum in sum[i] #### **Parallel Reduction Across Blocks** - Code lets B-thread block reduce B-element array - For larger sequences: - reduce each B-element subsequence with 1 block - write N/B partial sums to temporary array - repeat until done Level 0: 8 blocks eKernel<<<8,512>> Level 1: 1 block SameKernel<<<1,8>>>(...) - P.S. this works for min, max, \*, and friends too - as written requires associative & commutative function - can restructure to work with any associative functio #### **Language Extensions** Built-in Variables - dim3 gridDim; - Dimensions of the grid in blocks (gridDim.z unused) - dim3 blockDim; - Dimensions of the block in - dim3 blockIdx; - Block index within the grid - dim3 threadIdx; - Thread index within the block - Math Functions: sin, cos, tan, asin, ... - Math device functions: - \_\_sin, ... (faster, less accurate) - Atomic device functions: atomicAdd(), atomicCAS(),... - Can implement locks In Kernel Memory Management - malloc() - free() | _ | | | |---|--|--| | - | | | | - | | | | _ | | | | _ | | | | | | | | | | | | | | | | _ | | | | | | | | - | | | | - | | | | - | | | | _ | | | | _ | | | | | | | | - | | | | | | | | | | | | | | | | - | | | | - | | | | - | | | | _ | | | | _ | | | | _ | | | | - | | | | - | | | #### **Tesla Architecture** - Used for Technical and Scientific Computing - L1/L2 Data Cache - Allows for caching of global and local data Same on-chip memory used for Shared and L1 Configurable at kernel invocation #### Fermi Architecture - L1 cache for each SM - Shared memory/L1: use same memory Configurable partitions at kernel invocation 48KB shared/16KB L1 or 16KB shared/48KB L1 - Unified 768KB L2 Data Cache - Services all load, store, and texture requests #### **Kepler Architecture** - GK104/K10 early 2012) - Configurable shared memory access bank width: 4 / 8 bytes cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);... - GK110/K20 (late 2012) Dynamic parallelism, HyperQ, more regs/thread & DP throughput #### **CUDA Toolkit Libraries** NVIDIA GPU-accelerated math libraries: - cuFFT Fast Fourier Transforms Library - cuBLAS Complete BLAS library - cuSPARSE Sparse Matrix library - cuRAND Random Number Generation (RNG) Library - Performance improved since 3.1 - For more info see - CULA linear algebra library (commercial add-on) - Single precision version free, double costs \$s - Thrust: C++ template lib $\rightarrow$ STL-like — Boost-like saxpy: thrust::transform(x.begin(), x.end(), y.begin(), y.begin(), a \* \_1 + \_2); #### **Libraries & More** - Object linking - Plug-ins, libraries - Dynamic parallelism GPU threads can launch new kernels - RDMA from GPU(node1) → GPU(node2) #### **Tools** - Visual Profiler - Where is the time spent? - CUDA-gdb: debugger - Parallel Nsight + Eclipse - Debugger - Memory checker - Traces (CPU vs. GPU activity) - Profiler (memory, instruction throughput, stall) - Nvidia-smi - Turn off ECC - Read performance counters | - | | | |---|--|--| | | | | | | | | | | | | | | | | | | | | #### **Timing CUDA Kernels** Real-time Event API cudaEvent\_t cstart, cstop; float cdiff; cudaEventCreate(&cstart); cudaEventRecord( cstart, 0 ); kernel kernel y, z>>>(a,b,c,); cudaEventRecord( cstop, 0 ); cudaEventRecord( cstop, 0 ); cudaEventElapsedTime(&cdiff, cstart, cstop); Printf("CUDA time is %.3f usec\n", cdiff); cudaEventDestroy( cstart); cudaEventDestroy( cstop); 34 #### **Device Capabilities** - Need to compile for specific capability when needed - Flags in Makefile - Capability levels: - \_ 1.0: basic GPU (e.g., 8800 GTX) - 1.1: 32-bit atomics in global memory (e.g., GTX 280) - 1.2: 64-bit atomics in global+shared memory, warp voting - 1.3: double precision floating point -e.g., GTX 280/GTX 480, C1060/C1070, C2050/C2070 - 2.0: caches for global+shared memory - -e.g., GTX 480, C2050/C2070 - 3.0: more wraps, threads, blocks, registers... - -E.g., GTX 680 - 3.5: Dynamic parallelism, HyperQ - -E.g., Tesla K20? 35 #### **OpenACC** - Pragma-based Industry standard, the "OpenMP for GPUs", V1.0 - #pragma acc [clause] - For GPUs but also other accelera - For CUDA but also OpenCL... - Data movement: sync/async - Parallelism - Data layout and caching - Scheduling - Mixes w/ MPI, OpenMP - Works with C, Fortran | Open <b>ACC</b> | |-----------------------------| | DIRECTIVES FOR ACCELERATORS | | | #### **OpenACC Kernel Example** ``` • CPU void domany(...){ \texttt{\#pragma acc data} \ \backslash copy(x[0:n],y[0:n]) saxpy( n, a, x, y ); ``` ``` void saxpy( int n, float a, float* x, float* restrict y ) { int i; present(x[0:n], y[0:n]) for( i = 1; i < n; ++i ) y[i] += a*x[i];</pre> ``` #### **OpenACC Execution Constructs** - kernels [clauses...] \n { structured block} Wait: barrier - Run kernel on GPU - if (cond): only exec if cond is true - async: do not block when done - Loop [clauses...] - run iterations of loop on GPU - collapse(n): for next n loop nests - seq: sequential execution! - private ( list ): private copy of vars - firstprivate ( list ): copyin private - reduction (op:list): =\*|^&,&&,||,min/max - gang/worker: scheduling options - vector: SIMD mode - independent: iterations w/o hazards - update [clauses...] - - host ( list ): copy $\rightarrow$ CPU - device ( list ): copy $\rightarrow$ GPU - if/async: as before #### **OpenACC Data Constructs** - data [clauses...] \n {structure block} - Declare data for GPU memory - if/async: as before - copy( list ): Allocates list on GPU, copies data CPU→GPU when entering kernel and GPU→CPU when done - copyin( list ): same but only CPU→GPU copyout( list ): same but only GPU→CPU - create( list ): only allocate present( list ): data already on GPU (no copy) • present\_or\_copy[in/out[( list ): if not - present then copy [in/out] present\_or\_create( list ): if not present - then allocate - deviceptr( list ): lists pointers of device addresses, such as from acc\_malloc. #### **OpenACC Update** #### OpenACC Async #### **OpenACC Data Caching** • Uses shared memory (SM / scratch pad memory) \*pragma acc kernels loop present (a[:][js-1:je+1],b[:][js-1:js+1]) for(j = js; j <= js+) for (i = 2; i <= n-1; i++) \*pragma acc cache(b[i-1:i+1][j-1:j+1]) a[i][j] = b[i][j] + \* (b[i-1][j] + b[i+1][j] + b[i][j-1] + b[i][j+1]) #### OpenACC Parallel / Loop (for) ``` • GPU Parallel #pragma acc parallel \ copy(x[0:n],y[0:n]) { saxpy( n, a, x, y ); ``` ``` • GPU Loop ``` ``` protopy void saxpy( int n, float a, float* x, float* restrict y ) { int i; pragma acc loop for( i = 1; i < n; ++i ) y[i] += a*x[i]; }</pre> ``` 43 #### **OpenACC Runtime Constructs** - #include "openacc.h" - acc\_malloc( size\_t ) - acc\_free( void\* ) - acc\_async\_test( expression ) - acc\_async\_test\_all() - acc\_async\_wait( expression ) - acc\_async\_wait\_all() 44 # Cray OpenACC Directives, Options, Restructuring HPC Code CFT Performance Vectorization User This Feedback Loop Unique to Compilers! We can use this same methodology to enable effective migration of applications to Multi-core and Accelerators