Homework 1
Deadline: see web page
Assignments: All parts are
to be solved individually (turned in electronically, written parts in
ASCII text, NO Word, postscript, etc. permitted unless explicitly stated).
Please use the henry2 cluster (Linux). All programs
have to be written in C, translated with mpicc/gcc and turned in with a
corresponding Makefile.
-
(0 points) Learn how to compile and execute an MPI program.
-
Log in to a 32-bit login node of henry2:
ssh login.hpc.ncsu.edu -l <your-unity-username
Use your unity passwd.
Notice: To login to the 64-bit login node, you would need to type:
ssh login64.hpc.ncsu.edu -l <your-unity-username
-
Choose one of the three compilers: Gnu / Intel / Portland:
add gnu
add gnu_64 (for the 64-bit version)
add intel
add pgi
-
Write a simple MPI program, such as for calculating Pi (see class
notes or browse the web).
--- Notice: This is one of the few times you are
allowed to use program code from the web for a homework. See class policies.
-
Compile the program:
mpicc -g -o pi pi.c
-
Execute the program on 2 processors:
mpirun -np 2 pi
Try again with a different number of processors.
-
Create a
job
script pi.bsub (using mpiexec instead of mpirun) and submit a batch job for 2, 4, 8,
... processors with the LSF command
bsub < pi.bsub
Notice: "Because the job asks for 4 or fewer processors and less than
15 minutes of time, it goes into the high priority debug queue, so
that turnaround is fast and mistakes can be quickly corrected. "
-
Monitor the job's progress with
bjobs
Enter the command repeatedly until the job is done. Then, inspect the
output/error files.
-
If you ever want to kill a job, issue
bkill
where is the jobid is obtained from bjobs.
Other useful LSF commands: bpeek
(see output of running job), bhist, bqueues, bhosts, bmod, bbot/btop,
bswitch, bstop/bresume, bkill (see
LSF for
Users for details).
Other useful bsub options: -R "span[ptile=1]" (run 1 task per node).
-
Enhance your program with printf() statements and submit another job.
Check the output file for the printf output. The printf() debugging
technique is your best friend for batch jobs.
Notice: You have very limited disk space in your home directory on
henry2. However, there is more disk space at
/gpfs_share/csc548. Utilize it wisely as it is shared between all 548
students.
Hints:
Nothing to turn in, this is just a warm-up exercise.
-
(50 points) Write an MPI program that determines the
point-to-point message latency for pairs of nodes. You should exchange
point-to-point messages with short message volume (less than 1KB)
between any two nodes and time the round-trip time (rtt). Also report
min/max times. The result/output should be a three matrices with node
names (rows/columns) and min/max/rtt values in microseconds. Matrices
are preceded by their respective description: min/max/rtt (in a single
line). Report numbers for at least 16 different nodes. (You may try
larger values if you can get your job through the queues.)
In a README file, try to explain different values in the matrices in
reference to the possible network configuration of nodes on the
cluster.
Hints:
- man PMPI_Get_processor_name
- man gettimeofday
- Use exclusive execution and 1 processor per node resource bsub options:
#BSUB -n 16
#BSUB -x
#BSUB -R "span[ptile=1]"
- Average the rtt over 8 exchanges (skipping the first exchange) -- why?
- Ensure that only two nodes are exchanging messages at any time -- why?
(You could also compare with results not observing this hint.)
- A good message volume is your rank + your hostname (also for debugging).
- You may leave the diagonal zero in the matrices.
- Sample output:
AVG:
blade30-5 blade13-10 blade11-7 blade26-10 blade12-2 blade27-11 blade32-8 blade11-1 blade28-3 blade26-5 blade30-11 blade13-14 blade35-5 blade35-3 blade35-13 blade35-12
blade30-5 0 165 162 158 163 151 153 163 149 154 101 176 151 149 148 163
...
Turn in the files rtt.c, Makefile.rtt, rtt.out, rtt.bsub and rtt.README.
-
(50 points) Implement the Pi approximation algorithms in three
different ways: (c) with collective communication (Broadcast/Reduce,
see lecture nodes), (b) with blocking point-to-point communication
(Send/Receive) and (n) with nonblocking communication
(Isend/Irecv/Waitall/Wait). Options (b) and (n) should have two variants:
(r) rooted centralized approach (communicate with rank zero) and (t)
tree-based approach (manually create a binary reduction tree rooted in
rank zero and communicate along the edges to simulate the broadcast
and reduction).
Compare the performance (using MPI_Wtime) for long-running inputs
(large number of intervals) for each approach with submitted jobs (to
ensure low contention). Show your results and comment on the outcome
in the README file.
Turn in the files pic/pibr/pibt/pinr/pint.c, Makefile.pi, and pi.README.
Hints:
What to turn in for programming assignments:
-
commented program(s) as source code, comments count 15% of the points (see
class policy on guidelines on comments)
-
Makefiles (if required)
-
test programs as source (and input files, if required)
-
README (documentation to outline solution and list commands to
install/execute)
-
in each file, include the following information as a comment at the top
of the file, where "username" is your unity login name and the single author
is the person who wrote this file:
Single Author info:
username FirstName MiddleInitial LastName
How to turn in:
Use the "Submit Homework" link on the course web page. Please
upload all files individually (no zip/tar balls).
Remember: If you submit a file for the second time, it will overwrite
the original file.
Additional references: