Homework 2
 Deadline: see web page
 Assignments: All parts are
to be solved individually (turned in electronically, written parts in
ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).
Please use the
OS lab
machines or the henry2
cluster (Linux) depending on the assignment.  All programs
have to be written in C, translated with mpicc/gcc and turned in with a
corresponding Makefile.
- 
(0 points) Learn how to compile and execute a CUDA program.
 
-  (100 points, 20*4+20)
Modify the pi program from HW1 (see lecture slides) to parallelize for
CUDA. Follow the steps below.
 
- V1
-  Create a subdirectory projects/pi.
-  Create the kernel (GPU) file pi_kernel.cu.  In Version 1 (V1),
  this should be just the pi calculation using double precision
  (excluding the reduction step). Consider where in to store results.
-  Create the host (x86) file pi.cu. For V1, this includes all other
  code (initialization, memory transfers with CUDA, CUDA kernel call)
  as well as a reduction step of the individual results from each CUDA
  thread. You should time the entire calculation (from after input to
  before output) using gettimeofday(). You should also time the CUDA
  kernel using the cudaEventXXX() API (from before the first CUDA
  DMA/memcopy call to after the last CUDA DMA/memcopy call).
-  Create this Makefile in your directory.
-  Compile your program: type "make" in the top directory.
 
-  V2
-  Introduce a command line flag "-b" that, when set, performs a
  per-block reduction on the GPU. The results of each block are the
  aggregated on the host (x86) side. Consult the slides on how to
  perform CUDA reductions for this purpose.
 
-  V3
-  V3: Introduce a command line flag "-g" that, when set, performs a
  cross-block reduction on the GPU assuming that the per-block
  reductions have already been performed. No aggregation should be
  performed on the host (x86) side.
 Hint: Use a second CUDA kernel call that copies the global result of
  each block into a static array and perform the reduction within a
  single block on this array. Where do you store the result?
 
-  V4
-  V4: Create an MPI version of your program that parallelizes the
  pi calculation of p processors, each with one GPU. Each host/GPU
  pair is responsible for at least one interval of pi. Within intervals,
  the GPU performs the pi calculation. Reductions may occur
  with/without the flags from V2/V3.
  
 
-  Report the overall time and the CUDA kernel time for each version
  as part of your README file for 10,000,000 intervals,
  different numbers of GPU threads (16,
  1024, 32768) and different number of threads per block (16, 32, 512).
 
Turn in README, pi.cu, pi_kernel.cu, common.mk.
 
 
What to turn in for programming assignments:
- 
commented program(s) as source code, comments count 15% of the points (see
class policy on guidelines on comments)
- 
Makefiles (if required)
- 
test programs as source (and input files, if required)
- 
README (documentation to outline solution and list commands to
install/execute)
- 
in each file, include the following information as a comment at the top
of the file, where "username" is your unity login name and the single author
is the person who wrote this file:
Single Author info:
username FirstName MiddleInitial LastName