All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.
OBJS = mg.o ${COMMON}/print_results.o \ ${COMMON}/${RAND}.o ${COMMON}/timers.o pmpi.o
pmpi.o: pmpi.c ${CCOMPILE} pmpi.c
Hints:
[rank i] [rank j] [comm count] [comm time]into a file called matrix.data
Explain your graph, and put your discussion in the README file. A (not exhaustive) list of questions to consider are:
Hints
MG - Multi-Grid on a sequence of meshes, long- and short-distance communication, memory intensive
Multi-grid works by using a coarse grid (that is, fewer points) to give an inital approximation, and then progressively refining the grid to update that approximation.
Analogous to HW1, each iteration of the algorithm uses a numerical method to solve a differential equation. Thus, each node must do boundary communications.
Because MG changes the grid resolution at each iteration of the algorithm, these boundaries change. Because the algorithm goes coarser to finer, we get more boundaries (and thus more communications) at each iteration.
Keep these facts in mind when discussing this assignment.
Turn in pmpi.c, matrix.data, README.pmpi
An updated version of the serial lake program can be found here. Please use for this assignment.
Set up for the PGI Compiler
The Makefile is setup to use the PGI compiler; you must update your environment to use the PGI compiler. Follow the instructions at http://moss.csc.ncsu.edu/~mueller/cluster/arc/ under the section Using the PGI compilers V12.5
OpenMP
run_sim(...)to run using OpenMP directives. For this code, use the nthreads parameter to define the number of threads to use for running.
#pragma omp parallel for ... num_threads(nthreads)
#pragma omp parallel for ... schedule(dynamic)
#pragma omp parallel for ... schedule(static)
Time and record each method; use your times when discussing your results in the README.openmp file (see below)
/* update the calculation arrays for the next time step */ for( i = 0; i < n; i++ ) { for( j = 0; j < n; j++ ) { uo[i][j] = uc[i][j]; uc[i][j] = un[i][j]; } }
Report the time for the serial code (1 thread) to run with a grid of 1024 and 4 pebbles, up to a time of 2.0 seconds. (NOTE: this will take a good amount of time on one processor, roughly 1 hr in the worst case. Be sure your requested time is enough)
./lake 1024 4 2.0 1Report this same time for your OpenMP parallel code using the same parameters, with 16 threads
./lake 1024 4 2.0 16
When you turn in your README.openmp, consider the following questions:
As always, including sample timing data in your answers is prefered by your grader.
Keep this code as is for the next problem; we will remove the compiler option for OpenMP and replace it with OpenACC, and only need to carry around one set of code.
OpenACC
We will use the code from the previous problem. First, the Makefile must be updated. In the Makefile there is a variable called ACC; this defines the type of acceleration to use.
Change from using OpenMP:
ACC=-mpTo using OpenACC:
ACC=-acc -ta=nvidiaFinally, the submission script must be updated to submit to a CUDA queue. In lake.qsub, change
#PBS -q defaultto
#PBS -q cuda
Update lake.c to include simple loop accelerators, ie
#pragma acc kernels loop
These can be put, for now, on the outer for loops (you will experiment later to determine the best setup for OpenACC).
Keep the OpenMP directives in place; The options used to compile them (should have) been removed, so the compiler will just ignore them; we can keep the code portable this way.
Here, you will attempt to do the following
If you haven't by now, you should compile your program. Notice that PGI will dump a hefty amount of information onto the screen.
Much of this can be ignored. If you want to clean up this output, update the Minfo compiler flag in your makefile:
CFLAGS=-I$(IDIR) $(ACC) -fast -Minfo=accel -Minline -MsafeptrThis will only keep the OpenACC compiler information.
If you haven't by now, run your program. Below is a sample of a serial run, and our naive OpenACC run:
running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads ... Simulation took 1.661175 seconds
running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads ... Simulation took 3.469350 seconds
The naive implementation of OpenACC was slower than the serial code! You should notice something similar with your runs.
At this point, you should have code ready to optimize. From here on out, points are given to the degree you are able to optimize your code. Some things to consider (these are not questions you need to discuss, but to guide your optimizations):
As an example, here are results after a few optimizations are carried out for both OpenMP and OpenACC:
running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 16 threads ... Simulation took 15.543285 seconds
running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 1 threads ... Simulation took 1.589567 seconds
Your code is expected, at the very least (that is, for any points on this question), to be faster than the serial version. As a benchmark, use the parameters
./lake 512 4 4.0 1
In this benchmark OpenACC code should be at least 20x faster than the serial code in this benchmark to recieve full credit. Extra points (up to 2) will be given for code that is (+1) 40x faster, (+2)50x+ faster.
Note:The code must be consistantly faster. Once is not enough. Use the compiler options provided.
As a reference, the serial code had an average runtime of 56 s for me.
Report on your results in your README.openacc file. Please discuss your optimizations in detail: the effect of the problem size (smaller vs. larger grids, short vs. longer simulation times), where your biggest optimization came from (eg thread scheduling? memory manangment? screaming at the screen louder?) possible improvements in the code, ect.
Hints
start_lake_log(char* logfile); lake_log(char *msg, ...); stop_lake_log();These routines output a series of logging messages to the file logfile. Running times and debugging information are placed in this file.
./lake 128 5 1.0 8The output will go into ./. If you run with
/home/mydir/otherdir/lake 128 5 1.0 8The ouput will go to /home/mydir/otherdir/.
The provided QSUB script uses a hard-coded directory path, so that the output files all go where the program resides.
Be aware of this if you insist on using custom qsub scripts and/or interactive qsub.
If you're not aware, using qsub creates the shell variable
$PBS_O_WORKDIRThat is assigned the directory the parallel application is called from.
For example, any output like:
Accelerator region ignored Loop carried dependence... Complex loop carried dependence...Indicates a problem with your acceleration region. Your complier should explicitly indicate that GPU code was generated:
Accelerator kernel generated 239, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */ 241, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */ CC 1.3 : 11 registers; 72 shared, 8 constant, 0 local memory bytes CC 2.0 : 14 registers; 8 shared, 80 constant, 0 local memory bytes
#define _OPENMPSo you can check for OpenMP by using the preprocessor conditionals
#ifdef _OPENMP //code for omp #else //code for openacc #endif
Note: This is only necessary for code you use that are not preprocessor directives (#pragma's). For instance, you may want to use functions from "omp.h" or "openacc.h" in your code, and this method can check which you should import.
Turn in lake.c (with both OpenMP and OpenACC directives), Makefile, README.openacc, README.openmp
(Optional) If you've implemented an optimized OpenMP, please include lake_opt.c, README.opt.openmp
Single Author info:
username FirstName MiddleInitial LastName