HW 3

Homework 3

Deadline: see web page
Assignments: All parts are to be solved individually (turned in electronically, written parts in ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).

All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.

(50 points) Analyze MPI call statistics and desive a method through PMPI instrumentation to generate a communication matrix.
- (0 points) Familiarize yourself briefly with the documentation of the NAS Parallel Benchmark suite and download the source code of NPB V3.3.
  - tar xzf NPB-3.3.tar.gz
  - cd NPB-3.3/NPB-3.3-MPI
  - cp config/make.def.template config/make.def
  - edit config/make.def:
    - MPIF77=mpif77
    - MPICC=mpicc
    - FMPI_LIB =
    - CMPI_LIB =
  - make mg NPROCS=4 CLASS=A
  - mpirun -np 4 bin/mg.A.4
- (45 points) Write PMPI wrappers in a module named pmpi.c (in the MG subdirectory) to record the frequency and average time of MPI_Send events in a communication matrix indexed by MPI task (rank ID) for source, destation of the message (see slide 55 of MPI lecture).
  - edit MG/Makefile
  - update the OBJS variable to include pmpi.o
```
OBJS = mg.o ${COMMON}/print_results.o  \
	       ${COMMON}/${RAND}.o ${COMMON}/timers.o pmpi.o
```
  - Add the rule to compile pmpi.o after the rule for mg.o
```
pmpi.o:         pmpi.c
	      ${CCOMPILE} pmpi.c
```
    Hints:
    - Write a wrappers for Init, Send and Finalize.
    - Use PMPI_XXX calls to (a) call the respective MPI routines...
    - but also to set up your instrumentation environment (to inquire about your rank etc.)
    - and to collect all frequencies and times from remote nodes.
    - Rank 0 should print out the entire matrix, in the format
```
 [rank i] [rank j] [comm count] [comm time] 
```
      into a file called matrix.data
  - (5 points) Generate a colored plot of the communication count and node average times.
    - Capture the communication matrix for MG with 64 MPI tasks and class A input in a file matrix.data on arc.
    - run: gnuplot pmpi_mg.gnu
    - visualize your output: xv pmpi_data.png (or however you wish to view it)
    The results should look like this:
  Explain your graph, and put your discussion in the README file. A (not exhaustive) list of questions to consider are:
  - Describe the communication pattern. Where are most of the communications happening?
  - Does your timing make sense? Describe how the times are distributed.
  - Is there a general relationship to the number of sends to the average time? If so, explain.
  - What symmetries do you expect to see for send counts? For times? Are these symmetries observed? Why or Why not?
  Hints
  - Review the details of the MG benchmark.
```
MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive
```
  - Although not required to know, it may be useful to review the specific details of the algorithm given in the paper NAS Parallel Benchmarks, in particular section 2.2.2 Kernel MG: a simple 3D multigrid benchmark
  - Quick overview of MG benchmark algorithm: The Multi-grid method is an example of a class of algorithms that solve partial differential equations in (usually) linear time.
    Multi-grid works by using a coarse grid (that is, fewer points) to give an inital approximation, and then progressively refining the grid to update that approximation.
    Analogous to HW1, each iteration of the algorithm uses a numerical method to solve a differential equation. Thus, each node must do boundary communications.
    Because MG changes the grid resolution at each iteration of the algorithm, these boundaries change. Because the algorithm goes coarser to finer, we get more boundaries (and thus more communications) at each iteration.
    Keep these facts in mind when discussing this assignment.
  - Each processor should maintin two vectors, one that counts sends to each node in the communicator, and one that keeps a running average for the communication times. During your MPI_Finalize(), you will gather these vectors onto the root node to form a matrix, and output that matrix.
  - The GNUPLOT script outputs the 3D plot at one orientation. You are encouraged to view the plot in a window and move the orientation around to get a better feel for the layout.
Turn in pmpi.c, matrix.data, README.pmpi
(35 points) Modify the LAKE program from HW2 to parallelize for OpenMP.

An updated version of the serial lake program can be found here. Please use for this assignment.
Set up for the PGI Compiler
The Makefile is setup to use the PGI compiler; you must update your environment to use the PGI compiler. Follow the instructions at http://moss.csc.ncsu.edu/~mueller/cluster/arc/ under the section Using the PGI compilers V12.5
OpenMP
- A.
  - From the original serial code, update the function
```
run_sim(...)
```
    to run using OpenMP directives. For this code, use the nthreads parameter to define the number of threads to use for running.
```
 #pragma omp parallel for ... num_threads(nthreads)
```
  - You will run using different levels of parallelization.
    - Parallelize the inner loop (the inner for loop) of run_sim using OpenMP using dynamic scheduling.
```
 #pragma omp parallel for ... schedule(dynamic)
```
    - Parallelize the inner loop of run_sim using OpenMP using static scheduling.
```
 #pragma omp parallel for ... schedule(static)
```
    - Parallelize the outer loop (the outer for loop) of run_sim using OpenMP using a.) dynamic and b.) static scheduling.
  Time and record each method; use your times when discussing your results in the README.openmp file (see below)
- B.
  - The previous assignment used a call to memcpy to update the u arrays. In this new code, this update is done manually in a for loop.
```
/* update the calculation arrays for the next time step */
for( i = 0; i < n; i++ )
{
  for( j = 0; j < n; j++ )
  {
    uo[i][j] = uc[i][j];
    uc[i][j] = un[i][j];
  }
}
```
  - Parallelize this loop using OpenMP, as before trying both dynamic and static scheduling for both inner and outer loops.
Report the time for the serial code (1 thread) to run with a grid of 1024 and 4 pebbles, up to a time of 2.0 seconds. (NOTE: this will take a good amount of time on one processor, roughly 1 hr in the worst case. Be sure your requested time is enough)
```
 ./lake 1024 4 2.0 1 
```
Report this same time for your OpenMP parallel code using the same parameters, with 16 threads
```
 ./lake 1024 4 2.0 16 
```
When you turn in your README.openmp, consider the following questions:
- Does the scheduling type matter? Why or why not?
- Does which loop you parallelized matter? Why or why not?
- This program is particularly easy to parallelize. Why?
- (Optional, 2 extra points) Can you think of other optimizations either in the code or the OpenMP directives that could further speed up the program? Include timings of demonstration runs, and detail your optimizations. Answers to this question must include: optimized code (seperate from the base code for this problem), a thorough discussion of optimizations (which must include OpenMP optimizations), and timings from demonstration runs showing the speedup. Any of these that are missing will result in no points being awarded.
As always, including sample timing data in your answers is prefered by your grader.
Keep this code as is for the next problem; we will remove the compiler option for OpenMP and replace it with OpenACC, and only need to carry around one set of code.
(15 points) Modify the LAKE program from HW2 to parallelize for OpenACC.
OpenACC
We will use the code from the previous problem. First, the Makefile must be updated. In the Makefile there is a variable called ACC; this defines the type of acceleration to use.
Change from using OpenMP:
```
ACC=-mp
```
To using OpenACC:
```
ACC=-acc -ta=nvidia
```
Finally, the submission script must be updated to submit to a CUDA queue. In lake.qsub, change
```
#PBS -q default
```
to
```
#PBS -q cuda
```
Update lake.c to include simple loop accelerators, ie
```
#pragma acc kernels loop
```
These can be put, for now, on the outer for loops (you will experiment later to determine the best setup for OpenACC).
Keep the OpenMP directives in place; The options used to compile them (should have) been removed, so the compiler will just ignore them; we can keep the code portable this way.
Here, you will attempt to do the following
- Accelerate the finite-differencing loops.
- Accelerate the array update loop.
- Look for ways to optimize.
If you haven't by now, you should compile your program. Notice that PGI will dump a hefty amount of information onto the screen.
Much of this can be ignored. If you want to clean up this output, update the Minfo compiler flag in your makefile:
```
CFLAGS=-I$(IDIR) $(ACC) -fast -Minfo=accel -Minline -Msafeptr
```
This will only keep the OpenACC compiler information.
If you haven't by now, run your program. Below is a sample of a serial run, and our naive OpenACC run:
1. Serial lake.log:
```
running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads
...
Simulation took 1.661175 seconds
```
2. OpenACC lake.log:
```
running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads
...
Simulation took 3.469350 seconds
```
The naive implementation of OpenACC was slower than the serial code! You should notice something similar with your runs.
At this point, you should have code ready to optimize. From here on out, points are given to the degree you are able to optimize your code. Some things to consider (these are not questions you need to discuss, but to guide your optimizations):
- Why is the naive implementation slower?
- What could be done to work around this slowdown?
- Can the loops be rearranged to improve memory access on the device?
- Is there any OpenACC setup code that does not involve any simulation variables that could be moved out of run_sim?
- Can we manipulate the block/thread scheduling done by OpenACC to our advantage?
As an example, here are results after a few optimizations are carried out for both OpenMP and OpenACC:
1. OpenMP lake.log:
```
running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 16 threads
...
Simulation took 15.543285 seconds
```
2. OpenACC lake.log:
```
running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 1 threads
...
Simulation took 1.589567 seconds
```
Your code is expected, at the very least (that is, for any points on this question), to be faster than the serial version. As a benchmark, use the parameters
```
./lake 512 4 4.0 1
```
In this benchmark OpenACC code should be at least 20x faster than the serial code in this benchmark to recieve full credit. Extra points (up to 2) will be given for code that is (+1) 40x faster, (+2)50x+ faster.
Note:The code must be consistantly faster. Once is not enough. Use the compiler options provided.
As a reference, the serial code had an average runtime of 56 s for me.
Report on your results in your README.openacc file. Please discuss your optimizations in detail: the effect of the problem size (smaller vs. larger grids, short vs. longer simulation times), where your biggest optimization came from (eg thread scheduling? memory manangment? screaming at the screen louder?) possible improvements in the code, ect.
Hints
- Info on the new code
  - The new LAKE now does the entire calculation in run_sim (evolve has been removed)
  - The new LAKE code has the routines
```
start_lake_log(char* logfile);
lake_log(char *msg, ...);
stop_lake_log();
```
    These routines output a series of logging messages to the file logfile. Running times and debugging information are placed in this file.
  - Furthermore, all output files (logfile,lake_i/f.dat) will be placed in the directory that the program is invoked with. So if you run the program as
```
./lake 128 5 1.0 8
```
    The output will go into ./. If you run with
```
/home/mydir/otherdir/lake 128 5 1.0 8
```
    The ouput will go to /home/mydir/otherdir/.
    The provided QSUB script uses a hard-coded directory path, so that the output files all go where the program resides.
    Be aware of this if you insist on using custom qsub scripts and/or interactive qsub.
    If you're not aware, using qsub creates the shell variable
```
$PBS_O_WORKDIR
```
    That is assigned the directory the parallel application is called from.
- OpenMP/OpenACC hints
  - Be sure to check your output and that your code is producing valid results.
  - OpenMP tutorial/overview, OpenMP 3.0 spec
  - OpenACC quickguide
  - PGI OpenACC Getting Started Guide
  - Example OpenACC program + optimization techniques (this code is pretty similar to what we're doing; lots of good tips on performance analysis and optimatizations)
  - Examine your PGI Compiler output (from -Minfo) AND MAKE SURE IT IS CREATING KERNEL CODE. The OpenACC code generation can fail, and your code will still compile without any acceleration.
    For example, any output like:
```
Accelerator region ignored
Loop carried dependence...
Complex loop carried dependence...
```
    Indicates a problem with your acceleration region. Your complier should explicitly indicate that GPU code was generated:
```
Accelerator kernel generated
        239, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */
        241, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
             CC 1.3 : 11 registers; 72 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 8 shared, 80 constant, 0 local memory bytes
```
  - There is a bug in PGI OpenACC that makes explicit copying of scalars (that is, double h, double t, int n, ect.) fail. You have two options: i.) don't worry about copying them or ii.) put these into 1-element arrays. It's up to you if you think any scalars you use need to be explicitly copied onto the GPU, but from my tests it didn't seem to matter.
  - Because we are using the OpenMP/OpenACC code in the same file, you may find it useful to check which is being compiled for in the code. When OpenMP is used in the compliation, the compiler defines
```
#define _OPENMP
```
    So you can check for OpenMP by using the preprocessor conditionals
```
#ifdef _OPENMP
//code for omp
#else
//code for openacc
#endif
```
    Note: This is only necessary for code you use that are not preprocessor directives (#pragma's). For instance, you may want to use functions from "omp.h" or "openacc.h" in your code, and this method can check which you should import.
  - Example outputs
Turn in lake.c (with both OpenMP and OpenACC directives), Makefile, README.openacc, README.openmp
(Optional) If you've implemented an optimized OpenMP, please include lake_opt.c, README.opt.openmp

What to turn in for programming assignments:

commented program(s) as source code, comments count 15% of the points (see class policy on guidelines on comments)
Makefiles (if required)
test programs as source (and input files, if required)
README (documentation to outline solution and list commands to install/execute)
in each file, include the following information as a comment at the top of the file, where "username" is your unity login name and the single author is the person who wrote this file:

Single Author info:

username FirstName MiddleInitial LastName