HW 1

Homework 1

Deadline: Sept. 5, 2012
Assignments: All parts are to be solved individually (turned in electronically, written parts in ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).

Please use the ARC cluster (Linux). All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.

(0 points) Learn how to compile and execute an MPI program.
- If you're off campus, first log in to some NCSU machine
```
ssh remote.eos.ncsu.edu -l <your-unity-username›
```
  Use your unity passwd.
- Log in to the ARC cluster
```
ssh arc.csc.ncsu.edu -l <your-temp-username›
```
  Use your assigned passwd.
- We will compile and run a simple "Hello World" mpi program. Download the program mpi_hello.c
- Compile the program:
```
mpicc -g -o mpi_hello mpi_hello.c
```
- Execute the program on 2 processors:
```
mpirun -np 2 mpi_hello
```
  Try again with a different number of processors.
- Create a job script mpi_hello.qsub and submit a batch job for 2, 4, 8, ... processors with the qsub command
```
qsub mpi_hello.sub
```
  Notice: "Because the job asks for 4 or fewer processors and less than 15 minutes of time, it goes into the high priority debug queue, so that turnaround is fast and mistakes can be quickly corrected. "
  Example job script for mpi_hello here
- Monitor the job's progress with
```
qstat
```
  Enter the command repeatedly until the job is done. Then, inspect the output/error files.
- If you ever want to kill a job, issue
```
qdel [jobid]
```
  where is the jobid is obtained from qstat.
- See ARC web page for job submission parameters and scripts.
Hints:
- MPI API
- MPICH
- A User's Guide to MPI by Peter Pacheco
- MPI Quick Reference by LAM MPI
- Debugging: Gdb does not work with MPI, but totalview or pgdbg (the latter only for Portland Group compiled binaries) could be used with some training.
Nothing to turn in, this is just a warm-up exercise.
(50 points) Write an MPI program that determines the point-to-point message latency for pairs of nodes. Exchange point-to-point message with varying message sizes (8B, 16B, 32B, ... , 1M). Your program should run continuously (i.e., do not re-run for each message size - your program should iterate through 20 different message sizes).
- Using a total of 8 nodes, you will find and plot the average time for point-to-point message round-trip time (rtt) as a function of message size to each node, as well as using the min/max times on error bars. Repeat the procedure for 16 nodes. (You may try more node numbers if you can find space in the job queue).
- Use the root node as the sender; each rtt should be repeated several times, with the average taken as the result for the rtt average/min/max to that particular node.
- Using the topology file, in conjunction with a call in you program to MPI_Get_processor_name for all nodes, determine the distribution of nodes in your program. (This can be done after the run by inspection).
- Your plot whould include the average rtt as a function of the message size for at least three distinct recieve nodes, as well as error bars for the min/max values of each rtt for the message size. Look for interesting plots - do not just plot the first three nodes in order. In particluar, try to plot nodes that lie off the L2 switch of the root sender.
In a README file, explain your plots. Do some nodes take longer than others? In particicular, could your results be indicative of the underlying network configuration? Explain. Also discuss message size as it relates to latency; are there any odd data points in your graphs?
Hints:
- To ensure only one process per node, use this job submit parameters in your PBS script:
```
#PBS -l nodes=<# of nodes>:ppn=1
```
- When calculating average rtt, skip the first message exchange -- why?
- Ensure that only two nodes are exchaning messages at any one time -- why?
- Use your favoriate plotting program; your grader recommends gnuplot, but any program that graphs data is fine.
- It may be very useful to use logscale for both x,y axis of your plots.
- Reading the topology file:
  Compute nodes on the cluster are connected under a switch directly to 15 other computer nodes, than from there out into the wider network. The identifier of each switch is given by the lid field:
```
Switch	36 "S-0002c90200423e70"		# "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
```
  Here, lid 3 directly connects to compute nodes:
```
[1]	"H-0002c903000c26f2"[1](2c903000c26f3) 		# "compute-0-61 HCA-1" lid 49 4xQDR
[2]	"H-0002c903000bf36e"[1](2c903000bf36f) 		# "compute-0-60 HCA-1" lid 1 4xQDR
[3]	"H-0002c903000bf35a"[1](2c903000bf35b) 		# "compute-0-63 HCA-1" lid 28 4xQDR
[4]	"H-0002c903000c2646"[1](2c903000c2647) 		# "compute-0-62 HCA-1" lid 25 4xQDR
[5]	"H-0002c903000bf35e"[1](2c903000bf35f) 		# "compute-0-64 HCA-1" lid 31 4xQDR
[6]	"H-0002c903000c26de"[1](2c903000c26df) 		# "compute-0-65 HCA-1" lid 47 4xQDR
...
[25]	"H-0002c903000c26fa"[1](2c903000c26fb) 		# "compute-0-66 HCA-1" lid 112 4xQDR
[26]	"H-0002c903000c26e2"[1](2c903000c26e3) 		# "compute-0-67 HCA-1" lid 63 4xQDR
[27]	"H-0002c903000c263a"[1](2c903000c263b) 		# "compute-0-68 HCA-1" lid 59 4xQDR
[28]	"H-0002c903000c27c2"[1](2c903000c27c3) 		# "compute-0-69 HCA-1" lid 117 4xQDR
[29]	"H-0002c903000c27a6"[1](2c903000c27a7) 		# "compute-0-71 HCA-1" lid 34 4xQDR
[30]	"H-0002c903000c2732"[1](2c903000c2733) 		# "compute-0-70 HCA-1" lid 22 4xQDR
[31]	"H-0002c903000c265e"[1](2c903000c265f) 		# "compute-0-72 HCA-1" lid 29 4xQDR
[32]	"H-0002c903000c266a"[1](2c903000c266b) 		# "compute-0-75 HCA-1" lid 32 4xQDR
[33]	"H-0002c903000c264e"[1](2c903000c264f) 		# "compute-0-74 HCA-1" lid 26 4xQDR
[34]	"H-0002c903000c26ee"[1](2c903000c26ef) 		# "compute-0-76 HCA-1" lid 48 4xQDR
[35]	"H-0002c903000bf246"[1](2c903000bf247) 		# "compute-0-77 HCA-1" lid 33 4xQDR
[36]	"H-0002c903000c27ca"[1](2c903000c27cb) 		# "compute-0-73 HCA-1" lid 44 4xQDR
```
  Use this information in your discussion.
Example output: read as follows [message size] [node_1_avg node_1_min node_1_max] [node_2_avg node_2_min node_2_max] ...
```
8 8.276531e-06 5.960464e-06 1.788139e-05 5.517687e-06 3.814697e-06 1.096725e-05 5.994524e-06 5.006790e-06 1.096725e-05 6.267003e-06 5.006790e-06 1.192093e-05 5.858285e-06 5.006790e-06 1.096725e-05 5.994524e-06 5.006790e-06 1.192093e-05 6.709780e-06 4.768372e-06 1.215935e-05
```
Example output: example output graph here
Turn in the files rtt.c, rtt.out, rrt8.png, rtt16.png, and rtt.README
(rtt8.png is the graph using 8 nodes, rtt16.png is the graph using 16 nodes).
(50 points) Group problem (up to groups of 3)
Implement the function analysis code in a parallel scheme:

Download the code for this problem: hw1.c
This code implements the serial verison of the numerical techniques described below. You will update this code to run in parallel.
Given some function, the code will numerically compute derivatives at all grid points, integral of the function over the grid points, and the error and standard deviation of the error of the numerical computation.
IMPORTANT: You MUST supply a function that is continuous and smooth - that is, that has an analyitic first derivative. Examples include:
```
 x^2
 sqrt(x)
 sin(2 * x)
```
For a list of functions, their derivatives and integrals, see here
IMPORTANT: Be sure to choose a function that doesn't behave weird between the user-defined grid values XI and XF. That is, do not do illegal math operations like fn(x) = 1/x on the range with a grid value x = 0.0, or fn(x) = sqrt(x) with a grid value x = -1.0.
Compile the code with:
```
gcc -g -o hw1 hw1.c -lm
```
The program will output a file err.dat which gives:
- The average error in the derivative calculation
- The standard deviation of the error in the derivative calculation
- The relative error of the integral calculation
- A point-by-point list of the error at each grid point
Edit the program to include more grid points, different functions, ect.
Data Decomposition

A common approach to implementing a parallel program is data decomposition. The data the program will run with is split up into chunks and a copy of the program runs using each individual chunk.
Consider the array of N grid points x_1 to x_n:

For our program, we will split this grid into evenly spaced "slices" by processor rank:

IMPORTANT: Be sure that the number of grid points is evenly divisable by the number of processors, or make sure your final code handles the case where the number of processors does not evenly divide the number of grid zones.
Computing derivatives: Finite Differencing

If is often the case that we cannot analytically calculate the derivate of some data. However, we can approximate it using a finite differencing scheme.
The method comes from calculus. Recall the definition of a derivative of the function f(x):

We can assert, with reasonable justification, that if dx is small enough, a good approximation to the derivative is given by:

This is a good approximation, but a better one is to use the symmetric form of the above:

In our program, let xc[i] be our grid points, and let yc[i] be the value of our function at grid point xc[i]. Using the above, we can approximate the derivative at the grid point xc[i] as:
```
dydx[i] = (yc[i+1] - yc[i-1]) / (2.0 * dx)
```
We should notice two things: (i.) We need to know the boundary conditions of our full grid domain - that is, what the value of the function will be at the left and right edge of the grid, and (ii.) For our decomposed grid, we will need to communicate the boundary values of our function to the processors to the "left" and "right". The diagram below represents one side of this process. The process of rank i will recieve the value fn(x_4) (fn(x) represents the value of our function at point x) from the process of rank i+1 send the value of the function fn(x_3) to the process of rank i+1

Computing integrals: Trapazoidal Rule

Similar to derivatives, integrals of data values must usually be carried out numerically. Here we will use a simple method to approximate our integral using the trapazoidal rule
Recall that the integral of a function from x_1 to x_n can be written:

That is, we can compute one integral for each pair of grid points x_i and x_i+1, then add them together at the end.
To compute each integral, recall that the integral of a function can be described as the area under the curve. Like with the finite difference, if the distance between two grid points is small, we can make a good approximation the integral between two points by assuming the function is a straight line between the two points. Then, we just have to compute the area of a trapazoid:

Actually, we are computing the area of a rectangle where the center of the top edge coincides with the center of the interval we are integrating over, and the height of the rectangle is the average value of the function evaluated at the endpoints. In fact, this is equivalent to a trapazoid, but calcultating the area of a rectangle is easier. Grapically, the procedure looks like this:

The green area is our approximate value to the area under the curve, that is, the integral. Notice that it is has some gaps, and the biggest gaps occur where the curve is very steep.
For our parallel implementation, we again need to know the values across domain boundaries. Notice that we have already exchanged this information when calculating the derivatives, so we can simply use it again here. Also notice that, if there are N grid points on the domain, we calculate N - 1 integrals.
Error analysis: Mean and standard deviation

By now we know how to calculate good approximations of derivatives and integrals. However, we should recall that these are approximations, and the approximations we've been using are only good, not great. Also take note that these approximations only work with a specific class of functions that fall within narrow parameters. You can think about what happens to the finite difference method if the function is defined piece-wise (instead of continuously), or a function that varies rapidly (e.g. f(x) = sin(1000 * x))
Like with most numerical methods, determining error is important. There are many procedures to do so, and we will look at a simple but powerful error indicator: standard deviation
Standard deviation is simply a measure of how much all data points are spread out from their "expected" values. A small standard deviation means the data is "good", while a large standard deviation means the data is, maybe not necesarrily poor, but inconsistant.
We will look at the standard deviation of the relative error for the derivatives:

Note we will need the actual values of the derivatives here, so our function must have an exact expression for the derivative.
IMPORTANT: Be careful not to divide by zero here; choose a function whos derivative does not evaluate to zero on any of the grid points.
To find standard deviation, we must first calculate the mean (or average) error of the entire domain. So first we calculate the error at each grid point.
The standard deviation is then given as:

Where error_i is the error in the derivative at the point x_i
The calculation then involves three steps: (i.) Calculate the average error for the local domain, (ii.) Communicate the local average error to a single process, and calculate the average error for the entire domain, (iii.) use the global average error to calculate the standard deviation.
Problem procedure
Using the provided serial code, the assignment is as follows:
- Decompose the grid into slices based on processor rank.
- Calculate the derivative of the functions using a finite difference method. To properly calculate the derivate at the domain boundaries, use (a) a blocking point-to-point communication and (b) an non-blocking point-to-point communication.
- Calculate the integral of the function using the trapazoidal rule. Here, we will require again a communication across the boundary, as well as a global reduction sum. Do (a) a single call reduction and (b) manual tree-based reduction.
- Calculate the errors in the derivatives and in the integral. Communicate a vector of errors for each grid point to the root node, and have the root node calculate the average error and the standard deviation the derivatives, and the relative error in the calculation of the integral.
Some guidelines:
- Your program should produce accurate results; be sure to compare the output of the serial code to the parallel code.
- Your program should be robust enough to do all operations with a wide range of input parameters (e.g. number of grid points, number of processors used).
- Run your program with several different functions; your code will be tested with a unique function to check it runs correctly.
Compare the performance (using MPI_Wtime) for long-running inputs (large number of grid points) for each calculation (finite difference, integral, error), and each communication method (blocking/non-blocking point-to-point, single call/manual reduction) with submitted jobs (to ensure low contention). Show your results and comment on the outcome in the README file. Comment on how the errors change for i.) the number of grid points and ii.) the function used. Example output of err.dat here.
Example output of hw1_mpi.out here.
NOTE:The above outputs were created using the function fn(x) = sqrt(x)
Turn in the files hw1_mpi.c, err.dat, hw1_mpi.out, README
Peer evaluation: Each group members has to submit a peer evaluation form.

What to turn in for programming assignments:

commented program(s) as source code, comments count 15% of the points (see class policy on guidelines on comments)
Makefiles (if required)
test programs as source (and input files, if required)
README (documentation to outline solution and list commands to install/execute)
in each file, include the following information as a comment at the top of the file, where "username" is your unity login name and the single author is the person who wrote this file:

Single Author info:

username FirstName MiddleInitial LastName

Group info:

username FirstName MiddleInitial LastName

username FirstName MiddleInitial LastName

username FirstName MiddleInitial LastName

How to turn in:

Use the "Submit Homework" link on the course web page. Please upload all files individually (no zip/tar balls).

Remember: If you submit a file for the second time, it will overwrite the original file.

Additional references:

Cliff Notes for GDB - the very bare bones of GDB
(by Dr. McKinley of University of Massachusetts)
Gnu manuals (make, gdb and many more)

Homework 1

Data Decomposition

Computing derivatives: Finite Differencing

Computing integrals: Trapazoidal Rule

Error analysis: Mean and standard deviation

Problem procedure

What to turn in for programming assignments:

How to turn in:

Additional references: