Homework 3

Deadline: see web page
Assignments: All parts are to be solved individually (turned in electronically, written parts in ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).

All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.

  1. (50 points) Analyze MPI call statistics and desive a method through PMPI instrumentation to generate a communication matrix.

    Turn in pmpi.c, matrix.data, README.pmpi

  2. (35 points) Modify the LAKE program from HW2 to parallelize for OpenMP.

    An updated version of the serial lake program can be found here. Please use for this assignment.

    Set up for the PGI Compiler

    The Makefile is setup to use the PGI compiler; you must update your environment to use the PGI compiler. Follow the instructions at http://moss.csc.ncsu.edu/~mueller/cluster/arc/ under the section Using the PGI compilers V12.5

    OpenMP

    Report the time for the serial code (1 thread) to run with a grid of 1024 and 4 pebbles, up to a time of 2.0 seconds. (NOTE: this will take a good amount of time on one processor, roughly 1 hr in the worst case. Be sure your requested time is enough)

     ./lake 1024 4 2.0 1 
    Report this same time for your OpenMP parallel code using the same parameters, with 16 threads
     ./lake 1024 4 2.0 16 

    When you turn in your README.openmp, consider the following questions:

    As always, including sample timing data in your answers is prefered by your grader.

    Keep this code as is for the next problem; we will remove the compiler option for OpenMP and replace it with OpenACC, and only need to carry around one set of code.

  3. (15 points) Modify the LAKE program from HW2 to parallelize for OpenACC.

    OpenACC

    We will use the code from the previous problem. First, the Makefile must be updated. In the Makefile there is a variable called ACC; this defines the type of acceleration to use.

    Change from using OpenMP:

    ACC=-mp
    
    To using OpenACC:
    ACC=-acc -ta=nvidia
    
    Finally, the submission script must be updated to submit to a CUDA queue. In lake.qsub, change
    #PBS -q default
    
    to
    #PBS -q cuda
    

    Update lake.c to include simple loop accelerators, ie

    #pragma acc kernels loop
    

    These can be put, for now, on the outer for loops (you will experiment later to determine the best setup for OpenACC).

    Keep the OpenMP directives in place; The options used to compile them (should have) been removed, so the compiler will just ignore them; we can keep the code portable this way.

    Here, you will attempt to do the following

    If you haven't by now, you should compile your program. Notice that PGI will dump a hefty amount of information onto the screen.

    Much of this can be ignored. If you want to clean up this output, update the Minfo compiler flag in your makefile:

    CFLAGS=-I$(IDIR) $(ACC) -fast -Minfo=accel -Minline -Msafeptr
    
    This will only keep the OpenACC compiler information.

    If you haven't by now, run your program. Below is a sample of a serial run, and our naive OpenACC run:

    1. Serial lake.log:
      running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads
      ...
      Simulation took 1.661175 seconds
      
    2. OpenACC lake.log:
      running /home/cmmauney/hw3/lake/lake with (256 x 256) grid, until 1.000000, with 1 threads
      ...
      Simulation took 3.469350 seconds
      

    The naive implementation of OpenACC was slower than the serial code! You should notice something similar with your runs.

    At this point, you should have code ready to optimize. From here on out, points are given to the degree you are able to optimize your code. Some things to consider (these are not questions you need to discuss, but to guide your optimizations):

    As an example, here are results after a few optimizations are carried out for both OpenMP and OpenACC:

    1. OpenMP lake.log:
      running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 16 threads
      ...
      Simulation took 15.543285 seconds
      
    2. OpenACC lake.log:
      running /home/cmmauney/hw3/lake/lake with (1024 x 1024) grid, until 1.000000, with 1 threads
      ...
      Simulation took 1.589567 seconds
      

    Your code is expected, at the very least (that is, for any points on this question), to be faster than the serial version. As a benchmark, use the parameters

    ./lake 512 4 4.0 1
    

    In this benchmark OpenACC code should be at least 20x faster than the serial code in this benchmark to recieve full credit. Extra points (up to 2) will be given for code that is (+1) 40x faster, (+2)50x+ faster.

    Note:The code must be consistantly faster. Once is not enough. Use the compiler options provided.
    As a reference, the serial code had an average runtime of 56 s for me.

    Report on your results in your README.openacc file. Please discuss your optimizations in detail: the effect of the problem size (smaller vs. larger grids, short vs. longer simulation times), where your biggest optimization came from (eg thread scheduling? memory manangment? screaming at the screen louder?) possible improvements in the code, ect.

    Hints

    Turn in lake.c (with both OpenMP and OpenACC directives), Makefile, README.openacc, README.openmp
    (Optional) If you've implemented an optimized OpenMP, please include lake_opt.c, README.opt.openmp

What to turn in for programming assignments: