HW 2

Homework 2

Deadline: September 19, 2012
Assignments: All parts are to be solved individually (turned in electronically, written parts in ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).

Please use the ARC cluster for this assignment. All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.

(50 points) Custom message passing using sockets In this assignment, you will
- Use shell scripting to spawn several remote processes via ssh. You may use bash, csh or perl for spawning processes remotely on nodes.
- Create a custom "message passing" library using C socket programming.
- Test your library using HW1 #2 to measure the RTT.
If you are having trouble with scripting and qsub, take a look at this simple implementation of using qsub, bash scripting, and ssh to launch programs on several nodes: simple_mpi.tar
Hints:
- The environment variable $PBS_NODEFILE points to a file that contains the node names on which you should spawn via ssh. Should that file not exist, then just spawn on localhost.
- When spawning to other nodes, add additional parameters to the call of the binary (myrtt in this case). They should indicate who the root node is (machine name, port number). Children can then connect to the root.
- Also include the number of MPI tasks N (from mympirun -np N) as an argument. This way, the root can determine when all children have reported back. It can then let the children know about their hostnames/port numbers for communication. You may also use files for this purpose (much easier).
- You need to stip out these additional, hidden parameters as part of MPI_Init().
- Job Control in Bash
- C Socket programming in linux
Discussion information:
- OpenMPI TCP FAQ (lots of gritty details on how OpenMPI doesn network communication)
- MPI Performance Topics (general performance characteristics of MPI)
- Comparing Ethernet and Myrinet for MPI Communication" (an analysis of MPI over different network architectures)
Run the same setup as HW1 #2. Compare your results.
```
mympirun -np 4 myrtt
```
Turn in mympirun (a script), mympi.c/mympi.h (module containing the subset of MPI functionality required) and myrtt.c (same as in HW1 but referencing mympi.h).
(0 points) Learn how to compile and execute a CUDA program.
- See "Running CUDA Programs" on ARC
(50 points) Group problem (3 per group)
We will extend the methods of the last HW into two dimenions.
Download, extract, compile the code lake.tar
This program models the surface of a lake, where some pebbles have been thrown onto the surface. The program works as follows. In the spatial domain, a centralized finite difference is used to inform a zone of how to update itself using the information from its neighbors

The time domain does something similarly, but here using information from the previous two times

The program runs two versions of the algorithm, a CPU version, and a skeleton GPU version. Your task is to fill in the GPU algorithm to solve the same problem. Instructions
V0:
- Run the lake program
```
./lake {npoints} {npebbles} {end_time} {nthreads}
```
  npoints defines the grid size (npoints x npoints), npebbles is the number of pebbles that are generated in the program, end_time is the final time of the simulation, and nthreads will be used withe the GPU implementation.
  The following runs on a grid of (128 x 128), with 5 pebbles, for 1.0 seconds, using 8 GPU threads (implemented later):
```
./lake 128 5 1.0 8
Running ./lake with (128 x 128) grid, until 1.000000, with 8 threads
CPU took 0.294668 seconds
GPU computation: 0.001568 msec
GPU end-to-end: 0.000000 sec
```
- View the output in a heatmap with gnuplot:
  You will download the output files
```
lake_i.dat
lake_f.dat
```
  along with the gnuplot script heatmap.gnu to a machine that has gnuplot installed. Then, run
```
gnuplot heatmap.gnu
```
  This will create the files lake_i.png(the initial configuration), lake_f.png(the final configuration) in the directory.
V1:
- Fill in the function run_gpu in the file lakegpu.cu to run the same algorithm as the cpu version, but using CUDA kernels. The grid will be decomposed on the GPU into 2D blocks.
  The program takes as an argument nthreads. This will be the number of threads per block used on the GPU. So, for instance, with nthreads=8, and a domain of grid points (npoints=128 x 128), you will create (npoints/nthreads)x(npoints/nthreads) = (16 x 16) blocks, with (8 x 8) threads on each block.
- You will time your CUDA implementation using cudaEventXXX() API. Be sure to start timing before the first memcpy to the GPU and stop after the last memcpy off of the GPU.
- Compare the CPU/GPU runs for varying grid sizes (16, 32, 64, 128, ..., 1024, ect)
V2:
- Create an MPI version of your program the further decomposes your grid based on processor rank.
- Use 4 CUDA-enabled nodes in your implementation. Each node should communicate boundary information to the appropriate neigbor, then run the CUDA kernel during a time-step (one iteration of evolve).
- Have each node output it's own data file, labeled as
```
lake_f_0.dat //node 0
lake_f_1.dat //node 1
//ect.
```
Include in README a discussion of your results. Your discussion should include answering the following questions:
- How well does your algorithm scale on the GPU? Do you find cases (grid size, thread number, ect.) where the GPU implementation does not scale well? Why?
- In the serial code, compare your CPU and GPU runtimes for different grid sizes. When is the GPU better, and when is it worse?
- Integrating CUDA and MPI involves more sophisticated code. What problems did you encounter? How did you fix them?
Hints:
- If you are interested in the particular math behind this algorithm, here is a good introduction. In particular, we are solving the 2D wave equation with sources using finite differencing.
- The code stores the inital configurations of u^0 and u^1 in the variables
```
double *u_i0; //u^0
double *u_i1; //u^1
```
  These are passed to both the run_cpu and run_gpu routines; both routines should produce the same results.
- Create the proper device memory spaces; uc, un, uo should all exist on the device and be copied on the device at each iteration. In addition, all data in u_i0, u_i1, as well as pebbles should exist on the device, along with anything the alogrithm needs for each iteration.
- In the MPI Implementation, consider ways to speed up each iteration. Do you have to wait on the GPU kernel to finish executing before you exchange boundary information? Consider how you will update the boundary grid zones - do you do it on the CPU after the kernel updates the inner zones or copy over that information to the GPU before the kernel runs? You are encouraged to experiment with different methods to find the most efficent.
Turn in README, lake.cu, lakegpu.cu, Makefile
Peer evaluation: Each group members has to submit a peer evaluation form.

What to turn in for programming assignments:

commented program(s) as source code, comments count 15% of the points (see class policy on guidelines on comments)
Makefiles (if required)
test programs as source (and input files, if required)
README (documentation to outline solution and list commands to install/execute)
in each file, include the following information as a comment at the top of the file, where "username" is your unity login name and the single author is the person who wrote this file:

Single Author info:

username FirstName MiddleInitial LastName

Group info:

username FirstName MiddleInitial LastName

username FirstName MiddleInitial LastName

username FirstName MiddleInitial LastName