Homework 2
Deadline: see web page
Assignments: All parts are
to be solved individually (turned in electronically, written parts in
ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).
Please use the
OS lab
machines or the henry2
cluster (Linux) depending on the assignment. All programs
have to be written in C, translated with mpicc/gcc and turned in with a
corresponding Makefile.
-
(0 points) Learn how to compile and execute a CUDA program.
- (100 points, 20*4+20)
Modify the pi program from HW1 (see lecture slides) to parallelize for
CUDA. Follow the steps below.
- V1
- Create a subdirectory projects/pi.
- Create the kernel (GPU) file pi_kernel.cu. In Version 1 (V1),
this should be just the pi calculation using double precision
(excluding the reduction step). Consider where in to store results.
- Create the host (x86) file pi.cu. For V1, this includes all other
code (initialization, memory transfers with CUDA, CUDA kernel call)
as well as a reduction step of the individual results from each CUDA
thread. You should time the entire calculation (from after input to
before output) using gettimeofday(). You should also time the CUDA
kernel using the cudaEventXXX() API (from before the first CUDA
DMA/memcopy call to after the last CUDA DMA/memcopy call).
- Create this Makefile in your directory.
- Compile your program: type "make" in the top directory.
- V2
- Introduce a command line flag "-b" that, when set, performs a
per-block reduction on the GPU. The results of each block are the
aggregated on the host (x86) side. Consult the slides on how to
perform CUDA reductions for this purpose.
- V3
- V3: Introduce a command line flag "-g" that, when set, performs a
cross-block reduction on the GPU assuming that the per-block
reductions have already been performed. No aggregation should be
performed on the host (x86) side.
Hint: Use a second CUDA kernel call that copies the global result of
each block into a static array and perform the reduction within a
single block on this array. Where do you store the result?
- V4
- V4: Create an MPI version of your program that parallelizes the
pi calculation of p processors, each with one GPU. Each host/GPU
pair is responsible for at least one interval of pi. Within intervals,
the GPU performs the pi calculation. Reductions may occur
with/without the flags from V2/V3.
- Report the overall time and the CUDA kernel time for each version
as part of your README file for 10,000,000 intervals,
different numbers of GPU threads (16,
1024, 32768) and different number of threads per block (16, 32, 512).
Turn in README, pi.cu, pi_kernel.cu, common.mk.
What to turn in for programming assignments:
-
commented program(s) as source code, comments count 15% of the points (see
class policy on guidelines on comments)
-
Makefiles (if required)
-
test programs as source (and input files, if required)
-
README (documentation to outline solution and list commands to
install/execute)
-
in each file, include the following information as a comment at the top
of the file, where "username" is your unity login name and the single author
is the person who wrote this file:
Single Author info:
username FirstName MiddleInitial LastName