main system funded in part by NSF through CRI grant #0958311 |
Cooling door equipment and installation funded by NCSU CSC; GPUs funded in part by a grant from NCSU ETF funds, and by NVIDIA and HP donations |
|
gcc -fopenmp -o fn fn.c g++ -fopenmp -o fn fn.cpp gfortran -fopenmp -o fn fn.f
export OMP_PROC_BIND="true" export OMP_NUM_THREADS=4 export MV2_ENABLE_AFFINITY=0 unset GOMP_CPU_AFFINITY mpirun -bind-to numa ...
export OMP_PROC_BIND="true" export OMP_NUM_THREADS=8 export MV2_ENABLE_AFFINITY=0 unset GOMP_CPU_AFFINITY mpirun -bind-to numa ...
module load cudaNotice: Module cuda is activated by default.
git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/5_Simulations/nbody make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="61 75 86 87" ./nbody -benchmark cd ../../1_Utilities/bandwidthTest make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="61 75 86 87" ./bandwidthTest cd ../../0_Simple/matrixMulCUBLAS/ make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="61 75 86 87" ./matrixMulCUBLAS
module switch openmpi4 mvapich2
mpicc -O3 -o pi pi.c mpic++ -O3 -o pi pi.cpp mpifort -O3 -o pi pi.f
prun ./pi
mpiexec.hydra -n 16 -bootstrap slurm ./pi
module switch mvapich2 openmpi4
mpicc -O3 -o pi pi.c mpic++ -O3 -o pi pi.cpp mpifort -O3 -o pi pi.f
prun ./pi
mpirun --oversubscribe -np 32 ./pi
module unload cuda module load nvhpc
module unload openmpi4 module load nvhpc #compile for C (similar for C++/Fortran) mpicc ... #use mpirun or prun to execute prun ./a.out mpirun ./a.outNotice that this OpenMPI version has support for CUDA pointers, RDMA, and GPU Direct
mlogger -p 0 -o
wattsup ttyUSB0 watts
export OMPI_MCA_pml="^ucx" export OMPI_MCA_btl_openib_if_include="mlx5_0:1" mpirun python3.6 your-code.py
salloc -N 4 source hadoop-setup.shEvery time you do a new salloc, you should re-run the hadoop-setup script! Always make sure to use source to run it, otherwise the environment variables that the script sets will not get picked up by your current shell.
Inspect the script's output. You should see the Hadoop components being started, and the HDFS filesystem will be formatted. You can check with jps to see which components are running, they should be NodeManager, SecondaryNameNode, DataNode, ResourceManager, NameNode (and Jps). On any other node of your allocation, it would just be NodeManager, DataNode (and Jps), you can check with ssh cXX jps (where XX depends on the nodes you have in echo $SLURM_NODELIST).
If you do NOT see these components active, then something went wrong. (There is also a log in logs/main.log you can check, lots of log messages.) To check to see if everything was set up correctly, run:
hdfs dfs -ls /userThis should print out something like:
Found 1 items drwxr-xr-x - UNITYID supergroup 0 TIMESTAMP /user/UNITYID
The above hadoop-setup script creates a "hadoop" directory. It then links common hadoop files to this directory and sets some environment variables. Next, it creates some personalized files in the hadoop/etc/hadoop directory, which contain your unity ID and the nodes you currently have reserved (this is why you have to re-run the script for every new salloc). These files tell Hadoop where to store the Hadoop Distributed File System (HDFS). Temporary folders on each node are also created to hold the HDFS data. Finally, we configure a blank HDFS (starting the NameNode with -format argument), start the HDFS (starting the name/data/secondary nodes), and create a /user/UNITY-ID directory inside the HDFS (the one from the hdfs dfs -ls command).
Before running YOUR code, you must copy the input to the HDFS:
hdfs dfs -put input0 /user/$USER/input0
hdfs dfs -put input1 /user/$USER/input1
Compile and run YOUR code:
javac YOUR.java jar cf YOUR.jar YOUR*.class hadoop jar YOUR.jar YOUR input0 input1 &> hadoop_output.txt rm -rf output* hdfs dfs -get /user/$USER/output* .The hadoop_output.txt file will contain the hadoop output. Inspect this output to find any runtime errors/exceptions. As you are developing your code, you can repeatedly run the above five commands to compile/run your code and get the output.
When you are done, before releasing your reserved nodes, run the following to shut down the hdfs file system:
hadoop/sbin/stop-yarn.sh
hadoop/sbin/stop-dfs.sh
salloc -N4 tar xvf YOUR.tar cd YOUR source spark-hadoop-setup.sh hdfs dfs -put input /user/UNITYID/input
javac YOUR.java jar cf YOUR.jar YOUR*.class spark-submit --class YOUR YOUR.jar input &> spark_output.txt grep -v '^24\|^(\|^-' spark_output.txt > output.txt diff -s solution.txt output.txt
When you are done, before releasing your reserved nodes, run the following to shut down the hdfs file system:
hadoop/sbin/stop-dfs.sh
pip3 install jupyter seaborn pydot pydotplus graphviz -U --user #set a password for your sessions (for security!!) jupyter notebook password #start the server jupyter-notebook --NotebookApp.token='' --no-browser --ip=cXX #from your VPN/campus machine, assuming a port 8888 in the printed URL, issue: ssh <your-unity-id>@arc.csc.ncsu.edu -L 8889:cXX:8888 #point your local browser at http://localhost:8889 and enter the password
import torch print(torch.__version__) torch.cuda.is_available() print(torch.cuda.current_device()) torch.cuda.get_device_name(0) print(torch._C._cuda_getCompiledVersion()) print(torch.rand(2,3).cuda()) #fails on sm 3.5 GPUs or earlier
module avail # show which modules are available module load X export |grep X #shows what has been defined gcc/mpicc -I${X_INC} -L{X_LIB} -lx #for a library ./X #for a tool/program, may be some variant of 'X' depending on toolkit module switch X Y #for mutually exclusive modules if X is already loaded module unload X module info #learn how to use modulesCurrent list of available modules (w/ openmpi4 active, similar lists for other MPI variants):
------------------- /opt/ohpc/pub/moduledeps/gnu12-openmpi4 -------------------- adios/1.13.1 netcdf-fortran/4.6.0 scalapack/2.2.0 boost/1.80.0 netcdf/4.9.0 scalasca/2.5 dimemas/5.4.2 omb/6.1 scorep/7.1 extrae/3.8.3 opencoarrays/2.10.0 sionlib/1.7.7 fftw/3.3.10 petsc/3.18.1 slepc/3.18.0 hypre/2.18.1 phdf5/1.10.8 superlu_dist/6.4.0 imb/2021.3 pnetcdf/1.12.3 tau/2.31.1 mfem/4.4 ptscotch/7.0.1 trilinos/13.4.0 mumps/5.2.1 py3-mpi4py/3.1.3 netcdf-cxx/4.3.1 py3-scipy/1.5.4 ------------------------ /opt/ohpc/pub/moduledeps/gnu12 ------------------------ R/4.2.1 mpich/3.4.3-ofi pdtoolkit/3.25.1 gsl/2.7.1 mpich/3.4.3-ucx (D) plasma/21.8.29 hdf5/1.10.8 mvapich2/2.3.7 py3-numpy/1.19.5 likwid/5.2.2 openblas/0.3.21 scotch/6.0.6 metis/5.1.0 openmpi4/4.1.4 (L) superlu/5.2.1 -------------------------- /opt/ohpc/pub/modulefiles --------------------------- EasyBuild/4.6.2 nvhpc-hpcx-cuda12/23.7 autotools (L) nvhpc-hpcx/23.7 charliecloud/0.15 nvhpc-nompi/23.7 cmake/3.24.2 nvhpc/23.7 cuda (L) ohpc (L) gnu12/12.2.0 (L) os gnu9/9.4.0 papi/6.0.0 hwloc/2.7.0 (L) prun/2.2 (L) libfabric/1.13.0 (L) singularity/3.7.1 magpie/2.5 ucx/1.11.2 (L) nvhpc-byo-compiler/23.7 valgrind/3.19.0 nvhpc-hpcx-cuda11/23.7
This applies to: