Please use the ARC cluster (Linux). All programs have to be written in C, translated with mpicc/gcc and turned in with a corresponding Makefile.
ssh remote.eos.ncsu.edu -l <your-unity-usernameUse your unity passwd.
ssh arc.csc.ncsu.edu -l <your-temp-usernameUse your assigned passwd.
mpicc -g -o mpi_hello mpi_hello.c
mpirun -np 2 mpi_helloTry again with a different number of processors.
qsub mpi_hello.subNotice: "Because the job asks for 4 or fewer processors and less than 15 minutes of time, it goes into the high priority debug queue, so that turnaround is fast and mistakes can be quickly corrected. "
Example job script for mpi_hello here
qstatEnter the command repeatedly until the job is done. Then, inspect the output/error files.
qdel [jobid]where is the jobid is obtained from qstat.
Hints:
Nothing to turn in, this is just a warm-up exercise.
Using a total of 8 nodes, you will find and plot the average time for point-to-point message round-trip time (rtt) as a function of message size to each node, as well as using the min/max times on error bars. Repeat the procedure for 16 nodes. (You may try more node numbers if you can find space in the job queue).
Use the root node as the sender; each rtt should be repeated several times, with the average taken as the result for the rtt average/min/max to that particular node.
Using the topology file, in conjunction with a call in you program to MPI_Get_processor_name for all nodes, determine the distribution of nodes in your program. (This can be done after the run by inspection).
Your plot whould include the average rtt as a function of the message size for at least three distinct recieve nodes, as well as error bars for the min/max values of each rtt for the message size. Look for interesting plots - do not just plot the first three nodes in order. In particluar, try to plot nodes that lie off the L2 switch of the root sender.
In a README file, explain your plots. Do some nodes take longer than others? In particicular, could your results be indicative of the underlying network configuration? Explain. Also discuss message size as it relates to latency; are there any odd data points in your graphs?
Hints:
#PBS -l nodes=<# of nodes>:ppn=1
Compute nodes on the cluster are connected under a switch directly to 15 other computer nodes, than from there out into the wider network. The identifier of each switch is given by the lid field:
Switch 36 "S-0002c90200423e70" # "Infiniscale-IV Mellanox Technologies" base port 0 lid 3 lmc 0
Here, lid 3 directly connects to compute nodes:
[1] "H-0002c903000c26f2"[1](2c903000c26f3) # "compute-0-61 HCA-1" lid 49 4xQDR [2] "H-0002c903000bf36e"[1](2c903000bf36f) # "compute-0-60 HCA-1" lid 1 4xQDR [3] "H-0002c903000bf35a"[1](2c903000bf35b) # "compute-0-63 HCA-1" lid 28 4xQDR [4] "H-0002c903000c2646"[1](2c903000c2647) # "compute-0-62 HCA-1" lid 25 4xQDR [5] "H-0002c903000bf35e"[1](2c903000bf35f) # "compute-0-64 HCA-1" lid 31 4xQDR [6] "H-0002c903000c26de"[1](2c903000c26df) # "compute-0-65 HCA-1" lid 47 4xQDR ... [25] "H-0002c903000c26fa"[1](2c903000c26fb) # "compute-0-66 HCA-1" lid 112 4xQDR [26] "H-0002c903000c26e2"[1](2c903000c26e3) # "compute-0-67 HCA-1" lid 63 4xQDR [27] "H-0002c903000c263a"[1](2c903000c263b) # "compute-0-68 HCA-1" lid 59 4xQDR [28] "H-0002c903000c27c2"[1](2c903000c27c3) # "compute-0-69 HCA-1" lid 117 4xQDR [29] "H-0002c903000c27a6"[1](2c903000c27a7) # "compute-0-71 HCA-1" lid 34 4xQDR [30] "H-0002c903000c2732"[1](2c903000c2733) # "compute-0-70 HCA-1" lid 22 4xQDR [31] "H-0002c903000c265e"[1](2c903000c265f) # "compute-0-72 HCA-1" lid 29 4xQDR [32] "H-0002c903000c266a"[1](2c903000c266b) # "compute-0-75 HCA-1" lid 32 4xQDR [33] "H-0002c903000c264e"[1](2c903000c264f) # "compute-0-74 HCA-1" lid 26 4xQDR [34] "H-0002c903000c26ee"[1](2c903000c26ef) # "compute-0-76 HCA-1" lid 48 4xQDR [35] "H-0002c903000bf246"[1](2c903000bf247) # "compute-0-77 HCA-1" lid 33 4xQDR [36] "H-0002c903000c27ca"[1](2c903000c27cb) # "compute-0-73 HCA-1" lid 44 4xQDRUse this information in your discussion.
Example output: read as follows [message size] [node_1_avg node_1_min node_1_max] [node_2_avg node_2_min node_2_max] ...
8 8.276531e-06 5.960464e-06 1.788139e-05 5.517687e-06 3.814697e-06 1.096725e-05 5.994524e-06 5.006790e-06 1.096725e-05 6.267003e-06 5.006790e-06 1.192093e-05 5.858285e-06 5.006790e-06 1.096725e-05 5.994524e-06 5.006790e-06 1.192093e-05 6.709780e-06 4.768372e-06 1.215935e-05
Example output: example output graph here
Turn in the files rtt.c, rtt.out, rrt8.png, rtt16.png, and rtt.README
(rtt8.png is the graph using 8 nodes, rtt16.png is the graph using 16 nodes).
Download the code for this problem: hw1.c
This code implements the serial verison of the numerical techniques described below. You will update this code to run in parallel.
Given some function, the code will numerically compute derivatives at all grid points, integral of the function over the grid points, and the error and standard deviation of the error of the numerical computation.
IMPORTANT: You MUST supply a function that is continuous and smooth - that is, that has an analyitic first derivative. Examples include:
IMPORTANT: Be sure to choose a function that doesn't behave weird between the user-defined grid values XI and XF. That is, do not do illegal math operations like fn(x) = 1/x on the range with a grid value x = 0.0, or fn(x) = sqrt(x) with a grid value x = -1.0.
Compile the code with:
gcc -g -o hw1 hw1.c -lm
The program will output a file err.dat which gives:
A common approach to implementing a parallel program is data decomposition. The data the program will run with is split up into chunks and a copy of the program runs using each individual chunk.
Consider the array of N grid points x_1 to x_n:
For our program, we will split this grid into evenly spaced "slices" by processor rank:
IMPORTANT: Be sure that the number of grid points is evenly divisable by the number of processors, or make sure your final code handles the case where the number of processors does not evenly divide the number of grid zones.
If is often the case that we cannot analytically calculate the derivate of some data. However, we can approximate it using a finite differencing scheme.
The method comes from calculus. Recall the definition of a derivative of the function f(x):
We can assert, with reasonable justification, that if dx is small enough, a good approximation to the derivative is given by:
This is a good approximation, but a better one is to use the symmetric form of the above:
In our program, let xc[i] be our grid points, and let yc[i] be the value of our function at grid point xc[i]. Using the above, we can approximate the derivative at the grid point xc[i] as:
dydx[i] = (yc[i+1] - yc[i-1]) / (2.0 * dx)
We should notice two things: (i.) We need to know the boundary conditions of our full grid domain - that is, what the value of the function will be at the left and right edge of the grid, and (ii.) For our decomposed grid, we will need to communicate the boundary values of our function to the processors to the "left" and "right". The diagram below represents one side of this process. The process of rank i will recieve the value fn(x_4) (fn(x) represents the value of our function at point x) from the process of rank i+1 send the value of the function fn(x_3) to the process of rank i+1
Similar to derivatives, integrals of data values must usually be carried out numerically. Here we will use a simple method to approximate our integral using the trapazoidal rule
Recall that the integral of a function from x_1 to x_n can be written:
That is, we can compute one integral for each pair of grid points x_i and x_i+1, then add them together at the end.
To compute each integral, recall that the integral of a function can be described as the area under the curve. Like with the finite difference, if the distance between two grid points is small, we can make a good approximation the integral between two points by assuming the function is a straight line between the two points. Then, we just have to compute the area of a trapazoid:
Actually, we are computing the area of a rectangle where the center of the top edge coincides with the center of the interval we are integrating over, and the height of the rectangle is the average value of the function evaluated at the endpoints. In fact, this is equivalent to a trapazoid, but calcultating the area of a rectangle is easier. Grapically, the procedure looks like this:
The green area is our approximate value to the area under the curve, that is, the integral. Notice that it is has some gaps, and the biggest gaps occur where the curve is very steep.
For our parallel implementation, we again need to know the values across domain boundaries. Notice that we have already exchanged this information when calculating the derivatives, so we can simply use it again here. Also notice that, if there are N grid points on the domain, we calculate N - 1 integrals.
By now we know how to calculate good approximations of derivatives and integrals. However, we should recall that these are approximations, and the approximations we've been using are only good, not great. Also take note that these approximations only work with a specific class of functions that fall within narrow parameters. You can think about what happens to the finite difference method if the function is defined piece-wise (instead of continuously), or a function that varies rapidly (e.g. f(x) = sin(1000 * x))
Like with most numerical methods, determining error is important. There are many procedures to do so, and we will look at a simple but powerful error indicator: standard deviation
Standard deviation is simply a measure of how much all data points are spread out from their "expected" values. A small standard deviation means the data is "good", while a large standard deviation means the data is, maybe not necesarrily poor, but inconsistant.
We will look at the standard deviation of the relative error for the derivatives:
Note we will need the actual values of the derivatives here, so our function must have an exact expression for the
derivative.
IMPORTANT: Be careful not to divide by zero here; choose a function whos derivative does not
evaluate to zero on any of the grid points.
To find standard deviation, we must first calculate the mean (or average) error of the entire domain. So first we calculate the error at each grid point.
The standard deviation is then given as:
Where error_i is the error in the derivative at the point x_i
The calculation then involves three steps: (i.) Calculate the average error for the local domain, (ii.) Communicate the local average error to a single process, and calculate the average error for the entire domain, (iii.) use the global average error to calculate the standard deviation.
Some guidelines:
Compare the performance (using MPI_Wtime) for long-running inputs
(large number of grid points) for each calculation (finite difference, integral, error), and
each communication method (blocking/non-blocking point-to-point, single call/manual reduction)
with submitted jobs (to ensure low contention). Show your results and comment on the outcome
in the README file.
Comment on how the errors change for i.) the number of grid points and ii.) the function used.
Example output of err.dat here.
Example output of hw1_mpi.out here.
NOTE:The above outputs were created using the function fn(x) = sqrt(x)
Turn in the files hw1_mpi.c, err.dat, hw1_mpi.out, README
Single Author info:
username FirstName MiddleInitial LastName
Group info:
username FirstName MiddleInitial LastName
username FirstName MiddleInitial LastName
username FirstName MiddleInitial LastName
Remember: If you submit a file for the second time, it will overwrite the original file.