Homework 2

Deadline: see web page
Assignments: All parts are to be solved individually (turned in electronically, written parts in ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).

Please use the OS lab machines or the henry2 cluster (Linux) depending on the assignment. All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.

  1. (0 points) Learn how to compile and execute a CUDA program.

  2. (100 points, 20*4+20) Modify the pi program from HW1 (see lecture slides) to parallelize for CUDA. Follow the steps below.

    1. V1
      • Create a subdirectory projects/pi.
      • Create the kernel (GPU) file pi_kernel.cu. In Version 1 (V1), this should be just the pi calculation using double precision (excluding the reduction step). Consider where in to store results.
      • Create the host (x86) file pi.cu. For V1, this includes all other code (initialization, memory transfers with CUDA, CUDA kernel call) as well as a reduction step of the individual results from each CUDA thread. You should time the entire calculation (from after input to before output) using gettimeofday(). You should also time the CUDA kernel using the cudaEventXXX() API (from before the first CUDA DMA/memcopy call to after the last CUDA DMA/memcopy call).
      • Create this Makefile in your directory.
      • Compile your program: type "make" in the top directory.
    2. V2
      • Introduce a command line flag "-b" that, when set, performs a per-block reduction on the GPU. The results of each block are the aggregated on the host (x86) side. Consult the slides on how to perform CUDA reductions for this purpose.
    3. V3
      • V3: Introduce a command line flag "-g" that, when set, performs a cross-block reduction on the GPU assuming that the per-block reductions have already been performed. No aggregation should be performed on the host (x86) side.
        Hint: Use a second CUDA kernel call that copies the global result of each block into a static array and perform the reduction within a single block on this array. Where do you store the result?
    4. V4
      • V4: Create an MPI version of your program that parallelizes the pi calculation of p processors, each with one GPU. Each host/GPU pair is responsible for at least one interval of pi. Within intervals, the GPU performs the pi calculation. Reductions may occur with/without the flags from V2/V3.
        • Add to your ~/.bashrc and also enter on the command line (1st time only):
              export PATH="/usr/local/mpich/bin:$PATH"
              export LD_LIBRARY_PATH="/usr/local/mpich/lib:$LD_LIBRARY_PATH"
              
        • Modify your common/common.mk:
              LINK        := mpicc -fPIC
              
        • Follow the MPICH instructions (steps 1-3 only) on creating a login environment without a need to enter passwords (when going between the osXX machines).
        • Create a file hosts.
        • Compile: make
        • Run: mpirun -np 4 -machinefile hosts bin/linux/release/pi
        • Hint: If you have problems getting MPICH to work, try with a simply hello, world MPI program first.
    5. Report the overall time and the CUDA kernel time for each version as part of your README file for 10,000,000 intervals, different numbers of GPU threads (16, 1024, 32768) and different number of threads per block (16, 32, 512).

    Turn in README, pi.cu, pi_kernel.cu, common.mk.

What to turn in for programming assignments: