ARC: A Root Cluster for Research into Scalable Computer Systems

Official Annoucement of the ARC Cluster (local copy)
NCSU write-up on the ARC Cluster (local copy)
TechNewsDailyStory (local copy)

main system funded in part by NSF through CRI grant #0958311
Cooling door equipment and installation funded by NCSU CSC; GPUs funded in part by a grant from NCSU ETF funds, and by NVIDIA and HP donations

Overview ARC Cluster
Check out these slides.

Old ARC Cluster
Looking for the documentation of the old ARC cluster V2b, or the older ARC cluster V2 or the oldest ARC Cluster V1?
Hardware & Software Status

no news is good news
READ THIS BEFORE YOU RUN ANYTHING: Running Jobs (via Slurm)
(1) Notice: Store large data files in the Beegfs file system and not your home directory. The home directory is for programs and small data files, which should not exceed 40GB altogether. This upper limited is quota controlled, i.e., if exceeding it, you will have to remove files before you can create new ones. Checked via

quota -vs
(2) Once logged into ARC, immediately obtain access to a compute node (interactively) or schedule batch jobs as shown below. Do not execute any other commands on the login node!

interactively ("srun" has been depricated for interactive mode):

salloc -n 16 # get 16 cores (1 node) in interactive mode
mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c # compile MPI program
prun ./a.out # execute an MPI program over all allocated nodes/cores

in batch mode:

compile programs interactively beforehand (see above)
cp /opt/ohpc/pub/examples/slurm/job.mpi . # script for job
cat /opt/ohpc/pub/examples/slurm/job.mpi # have a look at it, executes a.out
sbatch job.mpi # submit the job, wait for it to get done: creates job.%j.out file, where %j is the job number

more slurm options:

salloc -n 16 -N 1 # get 1 interactive node with 16 cores
salloc -n 32 -N 2 -w c[90,91] #run on nodes 90+91
salloc -n 64 -N 4 -w c[90-93] #run on nodes 90-93
salloc -n 64 -N 4 -p broadwell #run on any 4 broadwell nodes
salloc -n 64 -N 4 -p rtx2070 #run on any 4 nodes with RTX 2070 GPUs
sinfo #available nodes in various queues, queues are listed in "Hardware" section
squeue # queued jobs
scontrol show job=16 # show details for job 16
scancel 16 # cancel job 16

Slurm documentation
Slurm command summary
Editors are available once on a compute node:

inside a terminal: vi, vim, emacs -nw
using a separate window: evim, emacs

(3) Use Virtual Studio only via tunnelling or proxy forward on compute nodes, not the login node. Turn off all plugins as they run remotely and stress the shared filesystem. (ARC is for performance-oriented computing, it should not serve mainly as a code development platform.)

How do I set up VScode via tunnelling/proxy?
Specifically, add to your ~/.ssh/config on your laptop:
Host arc.csc.ncsu.edu HostName arc.csc.ncsu.edu User YOUR-USERNAME IdentityFile YOUR-PATH-TO-4096-ARC-PRIVATE-KEY-FILE #can be omitted if it's ~/.ssh/id_rsa Host c* HostName %h User YOUR-USERNAME ProxyJump arc.csc.ncsu.edu IdentityFile YOUR-PATH-TO-4096-ARC-PRIVATE-KEY-FILE #can be omitted if it's ~/.ssh/id_rsa

ssh arc.csc.ncsu.edu #from command line, login to arc
salloc #get an allocation of a node, say you get c20
VScode: get the remote ssh extension, activate it as follows:
VScode: hit F1, enter remote-ssh: connect-to-host: c20 #enter c20 in VScode
wait for extensions to download (takes a while)
use VScode (but turn extensions off! copilot, C/C++ cpptools etc.)

Hardware
Over 1300 cores on 80 compute nodes integrated by Advanced HPC. All machines are 2-way SMPs (except for single socket AMD Rome/Milan/Siena machines) with either AMD or Intel processors (see below) and a total 16 physical cores per node (32 for c[30-31], 128 for c28).
Nodes:

Intel Broadwell: nodes c[50-59], queue: -p broadwell

Supermicro X10DRG-HT Motherboard with IPMI 2.0 PCIe 3
Transport 1028GR-TRT with 3 GPU slots + 1 half-size PCI-E slot
64GB DDR4 2400 ECC DRAM
Crucial_CT275MX3: Crucial MX300 2.5" 275GB SATA III 3-D Vertical Internal Solid State Drive (SSD) CT275MX300SSD under /mnt
SamsungPM1725 Series 1.6TB NVMe PCIe SSD on nodes c[50-53]

Intel Skylake Silver: nodes c[0-19,26-29], queue: -p skylake

Supermicro X11DPU Motherboard or X11DGQ Motherboard with IPMI 2.0 PCIe 3
Transport 6019U-TRT with 1 GPU slot + 1 half-size PCI-E slot or Transport 1029GQ-TRT with 4 GPU slots + 2 half-size PCI-E slots
96GB DDR4 2666 ECC DRAM
INTEL SDSC2KB240G7: Intel DC S4500 MX300 2.5" 240GB SATA III Internal Solid State Drive (SSD) or Samsung SSD 860 EVO 250GB under /mnt

AMD Epyc Rome: nodes c[20-25,33,37-48,73-79], queue: -p rome

Tyan S8021 Motherboard with IPMI 2.0 PCIe 3
Transport HX GA88-B8021 (B8021G88V2HR-2T-RM-N) with 4 GPU slots + 1 half-size PCI-E slot (except for c33, which has the AsRock configuration with a Rome CPU described below under Milan)
128GB DDR4 3200 ECC DRAM
Samsung SSD 860 EVO 250GB under /mnt

Intel Cascade Lake: nodes c[35-36], queue: -p cascade

Supermicro X11DPU Motherboard with IPMI 2.0 PCIe 3
Transport 1029U-TRT with 1 GPU slot + 1 half-size PCI-E slot
192GB DDR4 2666 ECC DRAM
1T Optane DC (8x Intel AEP 3DXP DCPMM128G DDR4-2666,HF,RoHS) NMA1XXD128GPSU via KMEM-DAX as NUMA nodes 2+3 (will be automatically used as DRAM extension, or use memkind w/ MEMKIND_DAX_KMEM to explicitly allocate space here)
2 Intel SSD Intel SSD D3-S4510 Series 2.5" 960GB SATA 2.5in SATA 6Gb/s, 3D2, TLC SSDSC2KB960G8 under / and /mnt

AMD Epyc Milan: nodes c[30-31,34], queue: -p milan

AMD EPYC 7543P (32 cores)
AsRock MBD-EPYC2ROME Motherboard with IPMI 2.0 PCIe 4 c[33-34] or ASUS KRPG-U8 Server Board with IPMI 2.0 PCIe 4 c[30-31]
Transport 1U4G-ROME with 4 GPU slots + 1 half-size PCI-E slot c[33-34] or Transport Asus ESC4000A-E11 with 4 GPU slots + 2 half-height-half-size PCI-E slots c[30-31]
128GB DDR4 3200 ECC DRAM c[34] or 512GB DDR4 3200 ECC DRAM c[30-31]
Samsung 970 PRO NVMe SSD 1TB c[34] or Micron Technology Inc 7450 PRO NVMe SSD 7.68TB c[30-31] under /
Samsung 970 EVO Plus NVMe M.2 SSD 250GB c[33] and Toshiba XG6 NVMe M.2 KXG60ZNV256G 256 GB SSD c[34] or Micron Technology Inc 7450 PRO NVMe SSD 1.92TB c[30-31]

AMD Epyc Siena: nodes c[60-72], queue: -p siena

Gigabyte MEG3-GU0 Motherboard with IPMI 2.0 PCIe 5
Gigabyte R143-EG0 (ACC2) transport 1U4G-ROME with 1 GPU slot + 1 half-size PCI-E slot + 1 OCP3 slot
192GB DDR5 4800 ECC DRAM
Samsung SSD 870 EVO 1TB under /
Samsung SSD 870 EVO 1TB

AMD Epyc Genoa: nodes c[32,49], queue: -p genoa

Asus K14PG-U12 Motherboard with IPMI 2.0 PCIe 5
Transport ESC4000A-E12 with 4 GPU slots + 1 half-size PCI-E slot
192GB DDR5 4800 ECC DRAM
Micron 7450 NVMe SSD 960GB under /
Micron 7450 NVMe SSD 960GB

AMD Epyc Turin: node c[28], queue: -p turin

AMD EPYC 9745 (128 cores)
Supermicro H13SSF Motherboard with IPMI 2.0 PCIe 5
Transport ASG-2115S-NE332R with 2 GPU slots + 2 half-size PCI-E slot
768GB DDR5 4800 ECC DRAM
Samsung MZQL2960HCJR-00A07 NVMe SSD 960GB under /
Solidigm D5-P5430 NVMe E3.S 7.5mm 3.84TB
4x Micron CZ120 256GB CLX devices for memory expansion

login node: arcs (1TB HDD+Intel X520-DA2 PCI Express 2.0 Network Adapter E10G42BTDABLK)
nodes: cXXX, XXX=0..107 (1TB HDD)

15 nodes with NVIDIA Quadro P4000 PCIe 3 (8 GB, sm 6.1): nodes c[0-1,3,8-19], queue: -p p4000
8 nodes with NVIDIA RTX 2060 PCIe 3 (6 GB, sm 7.5): nodes c[26,29,51-56], queue: -p rtx2060
1 node with NVIDIA RTX 2070 PCIe 3 (8 GB, sm 7.5): nodes c[50], queue: -p rtx2070
1 node with NVIDIA RTX 2080 PCIe 3 (8 GB, sm 7.5): nodes c[28], queue: -p rtx2080
17 nodes with NVIDIA RTX 2060 Super PCIe 3 (8 GB, sm 7.5): nodes c[21,25,34,38-48,57-59], queue: -p rtx2060super
1 node with NVIDIA RTX 2080 Super PCIe 3 (8 GB, sm 7.5): nodes c[22], queue: -p rtx2080super
2 nodes with NVIDIA RTX 3060 Ti PCIe 4 (8 GB, sm 8.6): node c[25,27], queue: -p rtx3060ti
6 nodes with NVIDIA RTX A4000 PCIe 4 (16 GB, sm 8.6): node c[4-7,35-36], queue: -p a4000
2 nodes with 4 GPUs each NVIDIA RTX A6000 PCIe 4 (48 GB, sm 8.6): node c[30-31], queue: -p a6000
1 node with NVIDIA A100 PCIe 4 (80 GB, sm 8.6): node c[33], queue: -p a100
7 nodes with NVIDIA RTX 4060 Ti PCIe 4 (8 GB, sm 8.7): node c[2,20-22,26-27,29], queue: -p rtx4060ti8g
23 nodes with NVIDIA RTX 4060 Ti PCIe 4 (16 GB, sm 8.9): node c[23-24,37,60-79], queue: -p rtx4060ti16g
2 nodes with 3 GPUs each NVIDIA RTX A5000Ada PCIe 5 (32 GB, sm 8.9): node c[32,49], queue: -p a5000ada
1 node with NVIDIA H100 NVL PCIe 5 (94 GB, sm 9.0): node c[28], queue: -p h100

head node: arcs (has 90TB RAID6 using 12xSAS3 10TB HDs (Seagate Exos X16 ST10000NM002G 10TB 7200 RPM) with Supermicro H12SSW-NTR Motherboard, Broadcom 3916 RAID controller AOC-S3916L-H16IR-32DD+ and Transport ASG-1014S-ACR12N4H plus a 10GEther 4xSPF+ Broadcom BCM57840S card 20Gbps bonded (dynamic link aggregation) to internal GEther switches
backup node: arcm (same configuration as arcs, except no 10GEther card)

Networking, Power and Cooling:

1 Mellanox QM8700-HS2F 1U HDR 40 port 200 Gbps Infiniband switch, configured as 80 port 100 Gbps switch, airflow power-to-connector (P2C = HS2F)
78 Mellanox MT2892 MCX683105AN-HDAT HDR 100 Gbps ConnectX-6 cards (c[0-29,30-36,41-79])
2 Mellanox MT28908 MCX653105A-ECAT-SP HDR 100 Gbps ConnectX-6 cards (c[39-40])
2 Mellanox MT42822 900-9D219-0086-ST0 MBF2M516A-EECOT HDR 100 Gbps Bluefield-2 cards (c[37-38])
40 Mellanox MCP7H50-H002R26 HDR 200 Gbps to 2x100 QSFP (2x100 Gpbs) 2m split cables
6 48-Port GigEther Switches Dell Networking X1052 Smart Web Managed Switch in 2 stacks via LACP (Link Aggregation Control Protocol) over 2 3m copper SFP+ bonded cables between any 2 switches, each tree (one tree for eth0, one for eth1, 20 Gbps per switch-to-switch connection)
2 CyberPower Smart App Sinewave UPS Series PR1000LCDRT1U
40 Cyberpower PDU15BV14F PDUs
41 Watts Up? Pro Energy Monitor 99333 (USB) / older serial version
4 APC AR3100 NetShelter SX 42U racks
4 Coolcentric RDWTS-04 passive cooling doors
2 Web Power Switch devices

Pictures

System Status

Monitor the System Status with Nagios (requires password)
Monitor the Server Room Temperature (should be < 80F)
- Monitor Temperature over Time (should be < 80F)

Software

All software is 64 bit unless marked otherwise.

OpenHPC 2.X / Rocky 8.X Linux x86_64 / 4.18.x Linux kernel (except for 5.19.x on c[35-36] due to required Optane support in kernel)
gcc/gfortran (C compiler)
MVAPICH2
Open MPI
Slurm
OpenMP (via Gcc or NVHPC/PGI compilers)
NVIDIA CUDA
NVHPC/PGI compilers
Lustre
BeeGFS
PAPI
likwid
Hadoop Map-Reduce and Spark
and many other packages

Obtaining an Account

for NCSU students/faculty/staff in Computer Science:
- Send an email to your advisor asking for ARC access and indicate your unity ID.
- Have your advisor endorse and forward the email to Subhendu Behera.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.
for NCSU students/faculty/staff outside of Computer Science:
- Send a 1-paragraph project description with estimated compute requirements (number of processors and compute hours per job per week) in an email to your advisor asking for ARC access and indicate your unity ID.
- Have your advisor endorse and forward the email to Subhendu Behera.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.
for non-NCSU users:
- Send a 1-paragraph project description with estimated compute requirements (number of processors and compute hours per job per week) in an email to your advisor asking for ARC access. Indicate the hostname and domain name that you will login from (e.g., sys99.csc.ncsu.edu).
- Have your advisor endorse and forward the email to Subhendu Behera.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.

Accessing the Cluster

Login for NCSU users:
- Login to a machine in the .ncsu.edu domain (or use NCSU's VPN).
- Then issue (optional -X for X11 display):
  - ssh -X arc.csc.ncsu.edu
- Or use your favorite ssh client under Windows from an .ncsu.edu machine.
Login for users outside of NCSU:
- Login to the machine that your public key was generated on.
  Non-NCSU access will only work for IP numbers that have been added as firewall exceptions, so please use only the computer (IP) you indicated to us any other computer will not work.
- Then issue (optional -X for X11 display):
  - ssh -X arc.csc.ncsu.edu
- Or use your favorite ssh client under Windows.

Using OpenMP (via gcc/g++/gfortran)

The "#pragma omp" directive in C/C++ programs works.

gcc -fopenmp -o fn fn.c
g++ -fopenmp -o fn fn.cpp
gfortran -fopenmp -o fn fn.f

To run under MVAPICH2 on Opteron nodes (4 NUMA domains over 16 cores), it's best to use 4 MPI tasks per node, each with 4 OpenMP threads:
```
export OMP_PROC_BIND="true"
export OMP_NUM_THREADS=4
export MV2_ENABLE_AFFINITY=0
unset GOMP_CPU_AFFINITY
mpirun -bind-to numa ...
```
To run under MVAPICH2 on Sandy/Ivy/Broadwell nodes (2 NUMA domains over 16 cores), it's best to use 2 MPI tasks per node, each with 8 OpenMP threads:
```
export OMP_PROC_BIND="true"
export OMP_NUM_THREADS=8
export MV2_ENABLE_AFFINITY=0
unset GOMP_CPU_AFFINITY
mpirun -bind-to numa ...
```

Running CUDA Programs (Version 12.5)

Load your paths:
```
module load cuda
```
Notice: Module cuda is activated by default.

Install/compile/run SDK samples in your directory:

git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/5_Simulations/nbody
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64
SMS="61 75 86 87"
./nbody -benchmark
cd ../../1_Utilities/bandwidthTest
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="61 75 86 87"
./bandwidthTest
cd ../../0_Simple/matrixMulCUBLAS/
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="61 75 86 87"
./matrixMulCUBLAS

Tools for Developing/Debugging CUDA Programs.
- cuda-gdb (CUDA debugger, any CUDA version)
- nsight-sys (Ecplise GUI for CUDA)
- ncu -o profile a.out (Profile CUDA behavior)
  - export TMPDIR=/tmp/nsight-compute-lock-MY-UNITY-ID
  - mkdir /tmp/nsight-compute-lock-MY-UNITY-ID
  - This will ensure that you don't get the error: ==ERROR== Error: Failed to prepare kernel for profiling
- ncu --import profile* (print out profiling info)
- ncu-ui (profiler GUI)
Nsight profiling guide
NVML API for GPU device monitoring
NVIDIA MIG (for A100: sudo nvidia-smi enabled, but make sure that before releasing the node to return the GPU to its initial state, i.e., deactive MIG with: sudo nvidia-smi -i 0 -mig 0)
GPUDirect (MVAPICH2 and Mellanox ConnectX-3+ required)
- see Keeneland's GPUDirect documentation on how to enhance your program/compile/run
- see Jacobi example for OpenACC directives exploiting GPUDirect
- example with 2 nodes GPUDv3 device-to-device RDMA
- set paths for CUDA and MVAPICH2 (using gnu or pgi)
- export MV2_USE_CUDA=1
- mpirun -np 2 /usr/mpi/pgi/mvapich2-1.9/tests/osu-micro-benchmarks-4.0.1/osu_latency -d cuda D D
Also installed (check Nvidia docs for more details):
- cudnn
- TensorRT (using TensorUnits on RTX GPUs) for cudnn
- nccl

Running MPI Programs with MVAPICH2 and gcc/g++/gfortran (Default)

Issue
```
module switch openmpi4 mvapich2
```

Compile MPI programs written in C/C++/Fortran:

mpicc -O3 -o pi pi.c
mpic++ -O3 -o pi pi.cpp
mpifort -O3 -o pi pi.f

Execute the program on, e.g., 2 processors (disabled on the login node, use compute nodes via "salloc -n 2 ..." instead):
```
prun ./pi
```
Execute on a subset of the processors allocated by salloc (e.g., "salloc -n 32 ..."):
```
mpiexec.hydra -n 16 -bootstrap slurm ./pi
```

switch back to OPENMPI4:

module switch mvapich2 openmpi4

Running MPI Programs with Open MPI and gcc/g++/gfortran (Alternative)

Compile MPI programs:

mpicc -O3 -o pi pi.c
mpic++ -O3 -o pi pi.cpp
mpifort -O3 -o pi pi.f

Execute the program on 2 processors (using Open MPI):
```
prun ./pi
```
Execute the program on 32 (virtual) processors using 16 (physical) cores (using Open MPI):
```
mpirun --oversubscribe -np 32 ./pi
```

Using the NVHPC/PGI compilers (V23.7 for CUDA 12.3)

(includes OpenMP and CUDA support via pragmas, even for Fortran)

Issue
```
module unload cuda
module load nvhpc
```
- For Fortran 77, use: pgf77 -V x.f
- For Fortran 90, use: pgf90 -V x.f
- For Fortran 95, use: pgf95 -V x.f
- For C++, use: pgc++ -V x.c
- For ANSI C, use: pgcc -V x.c
- For more compile output, add option: -Minfo=all
- For AMD 64-bit, add option: -tp=barcelona-64
- For OpenMP, add option: -mp
- For OpenACC/CUDA (default: 12.3), add options: -acc
- For OpenACC/CUDA 12.3: -acc -Mcuda=cuda12.3,rdc
- For OpenACC/CUDA on specific GPUs, run: pgaccelinfo, then use the respective -ta option in the output for compilation, e.g., -acc -ta=tesla,cc20
  - with filename.f: supports Fortran ACC pragmas (for CUDA), e.g., !$acc parallel
  - with filename.c: supports C ACC pragmas (for CUDA), e.g., #pragma acc parallel
Slides and excercises on MPI+GPU programming with CUDA and OpenACC
NVHPC/PGI Documentation
- PGI Accelerator Quick Reference
- Full OpenACC Support since V13.4
OpenAcc Documentation
- OpenAcc Quick Reference
- OpenAcc V2.0

NVHPCPGI+Open MPI (loaded with nvhpc):

module unload openmpi4
module load nvhpc
#compile for C (similar for C++/Fortran)
mpicc ...
#use mpirun or prun to execute
prun ./a.out
mpirun ./a.out

Notice that this OpenMPI version has support for CUDA pointers, RDMA, and GPU Direct

NVHPC/PGI+MVAPICH: not supported

Dynamic Voltage and Frequency Scaling (DVFS)

Change the frequency/voltage of a core to save energy (without any of with minor loss of performance, depending on how memory-bound an application is)
Use cpupower and its utilities to change processor frequencies. Notice: When hyperthreading (ht) is off, you can change one core at a time; but when ht if on, you will need to change correspodning pairs of cores, e.g., (0,16) or (1,17) etc. for 32 visible cores (adapt as needed).
Example 1 for core 0 without hyperthreading (requires sudo rights):
- grep " ht " /proc/cpuinfo #should not produce any output; o/w see example 2
- cpupower frequency-info
- sudo cpupower -c 0 frequency-set -f 1500Mhz #set to userspace 1.5GHz
- watch cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq #should be nearly constant
- cpupower frequency-info
- sudo cpupower frequency-set -g ondemand #revert to original settings
Example 2 for core (0,16) pair with hyperthreading (requires sudo rights):
- grep " ht " /proc/cpuinfo #should produce output
- cpupower frequency-info
- sudo cpupower -c 0 frequency-set -f 1500Mhz #set to userspace 1.5GHz
- sudo cpupower -c 16 frequency-set -f 1500Mhz #set to userspace 1.5GHz
- watch cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq #should be nearly constant
- cpupower frequency-info
- sudo cpupower frequency-set -g ondemand #revert to original settings

Power monitoring

Sets of three compute nodes share a power meter; in such a set, the lowest numbered node has the meter attached (either on the serial port or via USB). In addition, two individual compute nodes have power meters (with different GPUs). See this power wiring diagram to identify which nodes belong to a set. The diagram also indicates if a meter uses serial or USB for a given node. We recommend to explicitly request a reservation for all nodes in a monitored set (see salloc commands with host name option). Monitoring at 1Hz is accomplished with the following software tools (on the respective nodes where meters are attached):

export PATH=".:~/bin:/usr/local/bin:/usr/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib"
for serial meters, use:
```
mlogger -p 0 -o
```
for USB meters, use:
```
wattsup ttyUSB0 watts 
```
Linux support for Watts Up serial version
Linux support for Watts Up USB version
- Watts Up Pro Manual (USB) (local copy)

Virtualization

OpenHPC-specific documatention for Charliecloud (read this first!)
Charliecloud
- module load charliecloud
Singularity
- module load singularity

Storage: BeeGFS and Burst Buffers

BeeGFS /mnt/beegfs (for large data, slow)
- cd /mnt/beegfs #to access it from compute nodes
- mkdir $USER #to create your subdirectory (only needs to be done once)
- chmod 700 $USER #to ensure others cannot access you data (only done once)
- cd $USER #go to directory where you should place your large files
- about 145TB of storage over 16 servers (9TB each)
- via 100Gpbs IB switch RDMA connection (ib0)
- Not protected by RAID, not backed up!
Burst Buffers /mnt/local (for node-local storage, fast)
- cd /mnt/local #to access it from compute nodes (uses some local SSD)
- mkdir $USER #to create your subdirectory (only needs to be done once)
- chmod 700 $USER #to ensure others cannot access you data (only done once)
- cd $USER #go to directory where you should place your node-local files

PAPI

module load papi
Reads hardware performance counters
Check supported counters: papi_avail
Edit your source file to define performance counter events, read them and then print or process them, see PAPI API
Add to the Makefile compile options: -I${PAPI_INC}
Add to the Makefile linker options: -L${PAPI_LIB} -lpapi

likwid

module load likwid
Pins threads to specific cores, avoids Linux-based thread migration and may increase NUMA performance, see likwid project for a complete list of tools (power, pinning etc.)
print NUMA core topology: likwid-topology -c -g
Use likwid-pin to pin threads to specific cores
- Example: likwid-pin myapp
- Example: mpirun -np 2 /usr/local/bin/likwid-pin ./myapp
Use likwid-perfctr or likwid-mpirun mearure performance counters, optionally with pinned threads
Others: likwid-mpirun, likwid-powermeter, likwid-setfreq, ...

Python

python/python3 (version 3.6), pip/pip3 install -U --user...
python3.8 (version 3.8), pip3.8 install -U --user...
python3.9 (version 3.9), pip3.9 install -U --user...
python3.10 (version 3.10), pip3.10 install -U --user...
python3.11 (version 3.11), pip3.11 install -U --user...

mpi4py: Due to OpenHPC packing problem, currently requires setting below instead of pmix/ucx, use OpenMPI4 and

    export OMPI_MCA_pml="^ucx"
    export OMPI_MCA_btl_openib_if_include="mlx5_0:1"
    mpirun python3.6 your-code.py

Big Data software: Hadoop, Spark, Hbase, Storm, Pig, Phoenix, Kafka, Zeppelin, Zookeeper, and Alluxio

Option 1: Install hadoop in your directory. Create a subdirectory. From that subdirectory, run:
salloc -N 4 source hadoop-setup.sh
Every time you do a new salloc, you should re-run the hadoop-setup script! Always make sure to use source to run it, otherwise the environment variables that the script sets will not get picked up by your current shell.
Inspect the script's output. You should see the Hadoop components being started, and the HDFS filesystem will be formatted. You can check with jps to see which components are running, they should be NodeManager, SecondaryNameNode, DataNode, ResourceManager, NameNode (and Jps). On any other node of your allocation, it would just be NodeManager, DataNode (and Jps), you can check with ssh cXX jps (where XX depends on the nodes you have in echo $SLURM_NODELIST).
If you do NOT see these components active, then something went wrong. (There is also a log in logs/main.log you can check, lots of log messages.) To check to see if everything was set up correctly, run:
hdfs dfs -ls /user
This should print out something like:
Found 1 items drwxr-xr-x - UNITYID supergroup 0 TIMESTAMP /user/UNITYID

The above hadoop-setup script creates a "hadoop" directory. It then links common hadoop files to this directory and sets some environment variables. Next, it creates some personalized files in the hadoop/etc/hadoop directory, which contain your unity ID and the nodes you currently have reserved (this is why you have to re-run the script for every new salloc). These files tell Hadoop where to store the Hadoop Distributed File System (HDFS). Temporary folders on each node are also created to hold the HDFS data. Finally, we configure a blank HDFS (starting the NameNode with -format argument), start the HDFS (starting the name/data/secondary nodes), and create a /user/UNITY-ID directory inside the HDFS (the one from the hdfs dfs -ls command).
Before running YOUR code, you must copy the input to the HDFS:
hdfs dfs -put input0 /user/$USER/input0

hdfs dfs -put input1 /user/$USER/input1

Compile and run YOUR code:
javac YOUR.java jar cf YOUR.jar YOUR*.class hadoop jar YOUR.jar YOUR input0 input1 &> hadoop_output.txt rm -rf output* hdfs dfs -get /user/$USER/output* .
The hadoop_output.txt file will contain the hadoop output. Inspect this output to find any runtime errors/exceptions. As you are developing your code, you can repeatedly run the above five commands to compile/run your code and get the output.
When you are done, before releasing your reserved nodes, run the following to shut down the hdfs file system:
hadoop/sbin/stop-yarn.sh

hadoop/sbin/stop-dfs.sh

Spark uses HDFS as well, we need to set it up in a similar fashion, except some extra variables for Spark:

salloc -N4
tar xvf YOUR.tar
cd YOUR
source spark-hadoop-setup.sh
hdfs dfs -put input /user/UNITYID/input

javac YOUR.java
jar cf YOUR.jar YOUR*.class
spark-submit --class YOUR YOUR.jar input &> spark_output.txt
grep -v '^24\|^(\|^-' spark_output.txt > output.txt
diff -s solution.txt output.txt

When you are done, before releasing your reserved nodes, run the following to shut down the hdfs file system:

hadoop/sbin/stop-dfs.sh

Option 2: module load magpie
documentation

Tensorflow

Tensorflow
Option 1: install locally
- python3 -m pip install --upgrade --user pip #to upgrade pip
- pip3 install tensorflow
Option 2: see OpenHPC Exercise 4 -- Tensorflow under Horovod with Charliecloud

for jupyter-notebook to work, issue

pip3 install jupyter seaborn pydot pydotplus graphviz -U --user
#set a password for your sessions (for security!!)
jupyter notebook password
#start the server
jupyter-notebook --NotebookApp.token='' --no-browser --ip=cXX
#from your VPN/campus machine, assuming a port 8888 in the printed URL, issue:
ssh <your-unity-id>@arc.csc.ncsu.edu -L 8889:cXX:8888
#point your local browser at http://localhost:8889 and enter the password

Notice: Anyone on ARC can now connect to cXX:8888, not just you, and potentially write to your notebook, open other files in this directory and subdirectories!
Either use passwords (see above) or read about secure authentification for jupyter-nookbook for stronger protection.

PyTorch

PyTorch
pip3.8 install -U --user torch torchvision torchaudio

python3.8

import torch
print(torch.__version__)
torch.cuda.is_available()
print(torch.cuda.current_device())
torch.cuda.get_device_name(0)
print(torch._C._cuda_getCompiledVersion())
print(torch.rand(2,3).cuda()) #fails on sm 3.5 GPUs or earlier

no support for older GPUs, error message: "The NVIDIA driver on your system is too old". (You could build an older version from source, but we don't bother.)

Spack

If you install packages via Spack, ignore the micro architecture and compile them for 'x86_64'.

Create $SPACK_ROOT/etc/spack/packages.yaml and add:
```
packages:
  all:
    target: ['x86_64']
```
(optional) Create $SPACK_ROOT/etc/spack/concretizer.yaml and add:
```
  
concretizer:
  targets:
    granularity: generic
```

Other Packges

A number of packages have been installed, please check out their location (via: rpm -ql pkg-name) and documentation (see URLs) in this PDF if you need them. (Notice, only the mvapich2/openmpi/gnu variants are installed.) Typically, you can get access to them via:

  module avail # show which modules are available
  module load X
  export |grep X #shows what has been defined
  gcc/mpicc -I${X_INC} -L{X_LIB} -lx #for a library
  ./X #for a tool/program, may be some variant of 'X' depending on toolkit
  module switch X Y #for mutually exclusive modules if X is already loaded
  module unload X
  module info #learn how to use modules

Current list of available modules (w/ openmpi4 active, similar lists for other MPI variants):

------------------- /opt/ohpc/pub/moduledeps/gnu12-openmpi4 --------------------
   adios/1.13.1        netcdf-fortran/4.6.0    scalapack/2.2.0
   boost/1.80.0        netcdf/4.9.0            scalasca/2.5
   dimemas/5.4.2       omb/6.1                 scorep/7.1
   extrae/3.8.3        opencoarrays/2.10.0     sionlib/1.7.7
   fftw/3.3.10         petsc/3.18.1            slepc/3.18.0
   hypre/2.18.1        phdf5/1.10.8            superlu_dist/6.4.0
   imb/2021.3          pnetcdf/1.12.3          tau/2.31.1
   mfem/4.4            ptscotch/7.0.1          trilinos/13.4.0
   mumps/5.2.1         py3-mpi4py/3.1.3
   netcdf-cxx/4.3.1    py3-scipy/1.5.4

------------------------ /opt/ohpc/pub/moduledeps/gnu12 ------------------------
   R/4.2.1         mpich/3.4.3-ofi        pdtoolkit/3.25.1
   gsl/2.7.1       mpich/3.4.3-ucx (D)    plasma/21.8.29
   hdf5/1.10.8     mvapich2/2.3.7         py3-numpy/1.19.5
   likwid/5.2.2    openblas/0.3.21        scotch/6.0.6
   metis/5.1.0     openmpi4/4.1.4  (L)    superlu/5.2.1

-------------------------- /opt/ohpc/pub/modulefiles ---------------------------
   EasyBuild/4.6.2                nvhpc-hpcx-cuda12/23.7
   autotools               (L)    nvhpc-hpcx/23.7
   charliecloud/0.15              nvhpc-nompi/23.7
   cmake/3.24.2                   nvhpc/23.7
   cuda                    (L)    ohpc                   (L)
   gnu12/12.2.0            (L)    os
   gnu9/9.4.0                     papi/6.0.0
   hwloc/2.7.0             (L)    prun/2.2               (L)
   libfabric/1.13.0        (L)    singularity/3.7.1
   magpie/2.5                     ucx/1.11.2             (L)
   nvhpc-byo-compiler/23.7        valgrind/3.19.0
   nvhpc-hpcx-cuda11/23.7

Advanced topics (pending)

For all other topics, access is restricted. Request a root password. Also, read this documentation, which is only accessible from selected NCSU labs.

This applies to:

booting your own kernel
installing your own OS

Known Problems

Consult the FAQ. If this does not help, then please report your problem.

References:

A User's Guide to MPI by Peter Pacheco
Debugging: Gdb only works on one task with MPI, you need to "attach" to other tasks on the respective nodes. Another trick is to spawn terminals for each MPI rank: mpirun -np 2 /usr/bin/xterm -e gdb ./[my-a.out] or even start running with mpirun -np 2 /usr/bin/xterm -e gdb -ex run --args ./[my-a.out] [pgm-args]. We don't have totalview (an MPI-aware debugger), but it's available for free for students, though you would have to install it yourself. You can also use printf debugging, of course. If your program SEGVs, you can set ulimit -c unlimited and run the mpi program again, which will create one or more core dump files (per rank) named "core.PID", which you can then debug: gdb binary and then core core.PID.