ARC: A Root Cluster for Research into Scalable Computer Systems

Official Annoucement of the ARC Cluster (local copy)
NCSU write-up on the ARC Cluster (local copy)
TechNewsDailyStory (local copy)

main system funded in part by NSF through CRI grant #0958311
Cooling door equipment and installation funded by NCSU CSC; GPUs funded in part by a grant from NCSU ETF funds, and by NVIDIA and HP donations

Old ARC Cluster
Looking for the documentation of the old ARC Cluster?
Hardware Status

c[28,29,66,82] IB disabled due to maintenance.
c[0-29,58-107] are using a new 100Gbps IB switch creating their own IB domain, the others remain on the old QDR IB switch stack. This means MPI programs should remain within one set (100Gbps effective) or the other (40Gbps theoretical/32Gbps effective), i.e., you cannot run a single MPI program using Infiniband on nodes mixed from both sets.
c[30-57] will be retired soon.
READ THIS BEFORE YOU RUN ANYTHING: Running Jobs (via Slurm)
(1) Notice: Store large data files in the Beegfs file system and not your home directory. The home directory is for programs and small data files, which should not exceed 10GB altogether.
(2) Once logged into ARC, immediately obtain access to a compute node (interactively) or schedule batch jobs as shown below. Do not execute any other commands on the login node!

interactively:

srun -n 16 --pty /bin/bash # get 16 cores (1 node) in interactive mode
mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c # compile MPI program
prun ./a.out # execute an MPI program over all allocated nodes/cores

in batch mode:

compile programs interactively beforehand (see above)
cp /opt/ohpc/pub/examples/slurm/job.mpi . # script for job
cat /opt/ohpc/pub/examples/slurm/job.mpi # have a look at it, executes a.out
sbatch job.mpi # submit the job, wait for it to get done: creates job.%j.out file, where %j is the job number

more slurm options:

srun -n 32 -X --pty /bin/bash # get 32 cores (2 nodes) in interactive mode with X11 graphical output
srun -n 16 -N 1 --pty /bin/bash # get 1 interactive node with 16 cores
srun -n 32 -N 2 -w c[30,31] --pty /bin/bash #run on nodes 30+31
srun -n 64 -N 4 -w c[30-33] --pty /bin/bash #run on nodes 30-33
srun -n 64 -N 4 -p opteron --pty /bin/bash #run on any 4 opteron nodes
srun -n 64 -N 4 -p gtx480 --pty /bin/bash #run on any 4 nodes with GTX 480 GPUs
sinfo #available nodes in various queues, queues are listed in "Hardware" section
squeue # queued jobs
scontrol show job=16 # show details for job 16
scancel 16 # cancel job 16

Slurm documentation
Slurm command summary
Editors are available once on a compute node:

inside a terminal: vi, vim, emacs -nw
using a separate window: evim, emacs

Hardware
1728 cores on 108 compute nodes integrated by Advanced HPC. All machines are 2-way SMPs (except for single socket AMD Rome/Milan machines) with either AMD or Intel processors (see below) and a total 16 physical cores per node.
Nodes:

AMD Opteron: nodes c[30-60], queue: -p opteron

Supermicro H8DGG-QF Motherboard with IPMI 2.0
Transport 1022GG-TF with 2 GPU slots + 1 half-size PCI-E slot
32GB DDR3 1333 ECC DRAM
OCZ RevoDrive 120GB SSD under /mnt

Intel Sandy Bridge: nodes c[95-98], queue: -p sandy

Supermicro X9DRG-HF Motherboard with IPMI 2.0
Transport 1027G-TQF with 4 GPU slots + 1 half-size PCI-E slot
16GB DDR3 1600 ECC DRAM

Intel Ivy Bridge: nodes c[99-100], queue: -p ivy

Supermicro X9DRG-HF Motherboard with IPMI 2.0
Transport 1027GR-TSF with 3 GPU slots + 1 half-size PCI-E slot
16GB DDR3 1600 ECC DRAM

Intel Broadwell: nodes c[78-94], queue: -p broadwell

Supermicro X10DRG-HT Motherboard with IPMI 2.0
Transport 1028GR-TRT with 3 GPU slots + 1 half-size PCI-E slot
64GB DDR4 2400 ECC DRAM
Crucial_CT275MX3: Crucial MX300 2.5" 275GB SATA III 3-D Vertical Internal Solid State Drive (SSD) CT275MX300SSD under /mnt
SamsungPM1725 Series 1.6TB NVMe PCIe SSD on nodes c[78-81]

Intel Skylake Silver: nodes c[0-19,26-29], queue: -p skylake

Supermicro X11DPU Motherboard or X11DGQ Motherboard with IPMI 2.0
Transport 6019U-TRT with 1 GPU slot + 1 half-size PCI-E slot or Transport 1029GQ-TRT with 4 GPU slots + 2 half-size PCI-E slots
96GB DDR4 2666 ECC DRAM
INTEL SDSC2KB240G7: Intel DC S4500 MX300 2.5" 240GB SATA III Internal Solid State Drive (SSD) or Samsung SSD 860 EVO 250GB under /mnt

AMD Epyc Rome: nodes c[20-25,61,65-77,101-107], queue: -p rome

Tyan S8021 Motherboard with IPMI 2.0
Transport HX GA88-B8021 (B8021G88V2HR-2T-RM-N) with 4 GPU slots + 1 half-size PCI-E slot (except for c61, which has the AsRock configuration with a Rome CPU described below under Milan)
128GB DDR4 3200 ECC DRAM
Samsung SSD 860 EVO 250GB under /mnt

Intel Cascade Lake: nodes c[63-64], queue: -p cascade

Supermicro X11DPU Motherboard with IPMI 2.0
Transport 1029U-TRT with 1 GPU slot + 1 half-size PCI-E slot
192GB DDR4 2666 ECC DRAM
1T Optane DC (8x Intel AEP 3DXP DCPMM128G DDR4-2666,HF,RoHS) NMA1XXD128GPSU via KMEM-DAX as NUMA nodes 2+3
2 Intel SSD Intel SSD D3-S4510 Series 2.5" 960GB SATA 2.5in SATA 6Gb/s, 3D2, TLC SSDSC2KB960G8 under / and /mnt

AMD Epyc Milan: nodes c[62], queue: -p milan

AsRock MBD-EPYC2ROME Motherboard with IPMI 2.0
Transport 1U4G-ROME with 4 GPU slots + 1 fhalf-size PCI-E slot
128GB DDR4 3200 ECC DRAM
Samsung 970 PRO NVMe SSD 1TB under /
Samsung 970 EVO Plus NVMe M.2 SSD 250GB (c61) and Toshiba XG6 NVMe M.2 KXG60ZNV256G 256 GB SSD (c62) under /mnt

login node: arcl (1TB HDD+Intel X520-DA2 PCI Express 2.0 Network Adapter E10G42BTDABLK)
nodes: cXXX, XXX=0..107 (1TB HDD)

3 nodes with NVIDIA C/M2070 (6 GB, sm 2.0): nodes c[30-32], queue: -p c2070
10 nodes with NVIDIA GTX480 (1.5 GB, sm 2.0): nodes c[37-43,45-46,53], queue: -p gtx480
1 node with NVIDIA GTX680 (2 GB, sm 3.0): node c55, queue: -p gtx680
5 nodes with NVIDIA GTX780 (3 GB, sm 3.5): nodes c[33-36,49], queue: -p gtx780
2 nodes with NVIDIA GTX Titan X (12 GB, sm 5.2): nodes c[51-52], queue: -p gtxtitanx
2 nodes with NVIDIA GTX 1080 (8 GB, sm 6.1): nodes c[44,47], queue: -p gtx1080
1 node with NVIDIA Titan X (12 GB, sm 6.1): node c50, queue: -p titanx
16 nodes with NVIDIA Quadro P4000 (8 GB, sm 6.1): nodes c[0-3,8-19], queue: -p p4000
13 nodes with NVIDIA RTX 2060 (6 GB, sm 7.5): nodes c[26,29,48,79-83,84-85,88-90], queue: -p rtx2060
15 nodes with NVIDIA RTX 2070 (8 GB, sm 7.5): nodes c[78,91-95,97-104,107], queue: -p rtx2070
2 nodes with NVIDIA RTX 2080 (8 GB, sm 7.5): nodes c[24,28], queue: -p rtx2080
27 nodes with NVIDIA RTX 2060 Super (8 GB, sm 7.5): nodes c[20-21,25,54,56-60,62,65-77,86-87,96,106], queue: -p rtx2060super
2 nodes with NVIDIA RTX 2080 Super (8 GB, sm 7.5): nodes c[22-23], queue: -p rtx2080super
2 nodes with NVIDIA RTX 3060 Ti (8 GB, sm 8.6): node c[27,105], queue: -p rtx3060ti
6 nodes with NVIDIA A4000 (16 GB, sm 8.6): node c[4-7,63-64], queue: -p a4000
1 node with NVIDIA A100 (80 GB, sm 8.6): node c[61], queue: -p a100
Altera Arria 10 FPGA on c82
Altera Stratix 10 FPGA on c27,c28

head node: arch (has 8TB SSD RAID5 using 10xSamsung 960 PM863a) with Supermicro X10DRU-i Motherboard and Transport SYS-1028U-TRT plus a 10GEther 4xSPF+ Broadcom BCM57840S card 20Gbps bonded (dynamic link aggregation) to internal GEther switches
backup node: arcb (same configuration as arch, except no 10GEther card)

Networking, Power and Cooling:

1 Mellanox QM8700-HS2F 1U HDR 40 port 200 Gbps Infiniband switch, configured as 80 port 100 Gbps switch, airflow power-to-connector (P2C = HS2F)
78 Mellanox MT2892 MCX683105AN-HDAT HDR 100 Gbps ConnectX-6 cards (c[0-29,58-64,69-107])
2 Mellanox MT28908 MCX653105A-ECAT-SP HDR 100 Gbps ConnectX-6 cards (c[67-68])
2 Mellanox MT42822 900-9D219-0086-ST0 MBF2M516A-EECOT HDR 100 Gbps Bluefield-2 cards (c[65-66])
40 Mellanox MCP7H50-H002R26 HDR 200 Gbps to 2x100 QSFP (2x100 Gpbs) 2m split cables
8 Mellanox InfiniScale IV IS5025 QDR 36-Port InfiniBand Switches (fat tree)
1 Mellanox InfiniScale IV IS5030 QDR 36-Port InfiniBand Switch (managed)
108 Mellanox ConnectX-2 VPI InfiniBand Adapter Card - Part ID: MHQH19B-XTR (c[30-57])
6 48-Port GigEther Switches Dell Networking X1052 Smart Web Managed Switch in 2 stacks via LACP (Link Aggregation Control Protocol) over 2 3m copper SFP+ bonded cables between any 2 switches, each tree (one tree for eth0, one for eth1, 20 Gbps per switch-to-switch connection)
1 APC Smart-UPS 1500VA
40 Cyberpower PDU15BV14F PDUs
41 Watts Up? Pro Energy Monitor 99333 (USB) / older serial version
4 APC AR3100 NetShelter SX 42U racks
4 Coolcentric RDWTS-04 passive cooling doors
2 Web Power Switch devices

Pictures

System Status

Monitor the System Status with Ganglia (only some subnets in NCSU domain)
Monitor the Server Room Temperature (should be < 80F)
- Monitor Temperature over Time (should be < 80F)

Software

All software is 64 bit unless marked otherwise.

OpenHPC 1.2.1 / CentOS 7.9 Linux x86_64 / 5.4.x Linux kernel (4.10.x on systems w/ older GPUs due to driver constraints)
gcc/gfortran (C compiler)
MVAPICH2
Open MPI
Slurm
OpenMP (via Gcc or PGI compilers)
NVIDIA CUDA
PGI compilers
Lustre
BeeGFS
PAPI
likwid
Hadoop Map-Reduce and Spark
and many other packages

Obtaining an Account

for NCSU students/faculty/staff in Computer Science:
- Send an email to your advisor asking for ARC access and indicate your unity ID.
- Have your advisor endorse and forward the email to Subhendu Behera.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.
for NCSU students/faculty/staff outside of Computer Science:
- Send a 1-paragraph project description with estimated compute requirements (number of processors and compute hours per job per week) in an email to your advisor asking for ARC access and indicate your unity ID.
- Have your advisor endorse and forward the email to Subhendu Behera.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.
for non-NCSU users:
- Send a 1-paragraph project description with estimated compute requirements (number of processors and compute hours per job per week) in an email to your advisor asking for ARC access. Indicate the hostname and domain name that you will login from (e.g., sys99.csc.ncsu.edu).
- Have your advisor endorse and forward the email to Utsab Ray.
- If approved, you will be sent a secure link to upload your public RSA key (with a 4096 key length) for SSH access.

Accessing the Cluster

Login for NCSU users:
- Login to a machine in the .ncsu.edu domain (or use NCSU's VPN).
- Then issue:
  - ssh arc.csc.ncsu.edu
- Or use your favorite ssh client under Windows from an .ncsu.edu machine.
Login for users outside of NCSU:
- Login to the machine that your public key was generated on.
  Non-NCSU access will only work for IP numbers that have been added as firewall exceptions, so please use only the computer (IP) you indicated to us any other computer will not work.
- Then issue:
  - ssh arc.csc.ncsu.edu
- Or use your favorite ssh client under Windows.

Using OpenMP (via gcc/g++/gfortran)

The "#pragma omp" directive in C/C++ programs works.

gcc -fopenmp -o fn fn.c
g++ -fopenmp -o fn fn.cpp
gfortran -fopenmp -o fn fn.f

To run under MVAPICH2 on Opteron nodes (4 NUMA domains over 16 cores), it's best to use 4 MPI tasks per node, each with 4 OpenMP threads:
```
export OMP_PROC_BIND="true"
export OMP_NUM_THREADS=4
export MV2_ENABLE_AFFINITY=0
unset GOMP_CPU_AFFINITY
mpirun -bind-to numa ...
```
To run under MVAPICH2 on Sandy/Ivy/Broadwell nodes (2 NUMA domains over 16 cores), it's best to use 2 MPI tasks per node, each with 8 OpenMP threads:
```
export OMP_PROC_BIND="true"
export OMP_NUM_THREADS=8
export MV2_ENABLE_AFFINITY=0
unset GOMP_CPU_AFFINITY
mpirun -bind-to numa ...
```

Running CUDA Programs (Versions 8.0, 10.0, 11.1)

Load your paths:
```
module load cuda
```
Notice: Module cuda is activated by default. Capability 3.5 and later devices use CUDA 11.0, capability 3.0 devices use CUDA-10.0 (last supported driver is 410 for GTX 680), capability 2.0 devices use CUDA 8.0 (last supported driver is 375 for GTX 480, C2050/2070). This means that MPI programs using CUDA should run on all same capability devices (nodes), but NOT a mix of both! Use srun -p ... to ensure this is the case. If a program is compiled for CUDA 11.1, it will fail to run on capability 3.5 or 2.0 device with the message:
```
-> CUDA driver version is insufficient for CUDA runtime version
```
Simply recompile, and it should work, each node should have a matching CUDA version for its GPU.

Install/compile/run SDK samples in your directory:

cuda-install-samples-11.1.sh .
#older version: cuda-install-samples-10.0.sh .
cd NVIDIA_CUDA-*_Samples/5_Simulations/nbody
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64
SMS="35 37 50 52 60 61 75 87"
#older: make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="30 35 37 50 52 60 61"
./nbody -benchmark
cd ../../1_Utilities/bandwidthTest
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="35 37 50 52 60 61 75 86"
./bandwidthTest
cd ../../0_Simple/matrixMulCUBLAS/
make CCFLAGS=-I${CUDA_HOME}/include LDFLAGS=-L${CUDA_HOME}/lib64 SMS="35 37 50 52 60 61 75 86"
./matrixMulCUBLAS

Tools for Developing/Debugging CUDA Programs (most of them only for CUDA 11+).
- cuda-gdb (CUDA debugger, any CUDA version)
- nsight-sys (Ecplise GUI for CUDA)
- ncu -o profile a.out (Profile CUDA behavior)
  - export TMPDIR=/tmp/nsight-compute-lock-MY-UNITY-ID
  - mkdir /tmp/nsight-compute-lock-MY-UNITY-ID
  - This will ensure that you don't get the error: ==ERROR== Error: Failed to prepare kernel for profiling
- ncu --import profile* (print out profiling info)
- ncu-ui (profiler GUI)
Nsight profiling guide
NVML API for GPU device monitoring
NVIDIA MIG (for A100: sudo nvidia-smi enabled, but make sure that before releasing the node to return the GPU to its initial state, i.e., deactive MIG with: sudo nvidia-smi -i 0 -mig 0)
GPUDirect (MVAPICH2 and Mellanox ConnectX-3+ required)
- see Keeneland's GPUDirect documentation on how to enhance your program/compile/run
- see Jacobi example for OpenACC directives exploiting GPUDirect
- example with 2 nodes GPUDv3 device-to-device RDMA
- set paths for CUDA and MVAPICH2 (using gcc or pgi)
- export MV2_USE_CUDA=1
- mpirun -np 2 /usr/mpi/pgi/mvapich2-1.9/tests/osu-micro-benchmarks-4.0.1/osu_latency -d cuda D D
Also installed (check Nvidia docs for more details):
- cudnn
- TensorRT (using TensorUnits on RTX GPUs) for cudnn
- nccl

Running MPI Programs with MVAPICH2 and gcc/g++/gfortran (Default)

Compile MPI programs written in C/C++/Fortran:

mpicc -O3 -o pi pi.c
mpic++ -O3 -o pi pi.cpp
mpifort -O3 -o pi pi.f

Execute the program on, e.g., 2 processors (disabled on the login node, use compute nodes via "srun -n 2 ..." instead):
```
prun ./pi
```
Execute on a subset of the processors allocated by srun (e.g., "srun -n 32 ..."):
```
mpiexec.hydra -n 16 -bootstrap slurm ./pi
```

Running MPI Programs with Open MPI and gcc/g++/gfortran (Alternative)

Issue

module switch mvapich2 openmpi

or, for new version,

module switch gnu gnu8
module switch mvapich2 openmpi3

Compile MPI programs:

mpicc -O3 -o pi pi.c
mpic++ -O3 -o pi pi.cpp
mpifort -O3 -o pi pi.f

Execute the program on 2 processors (using Open MPI):
```
prun ./pi
```
Execute the program on 32 (virtual) processors using 16 (physical) cores (using Open MPI):
```
mpirun --oversubscribe -np 32 ./pi
```

switch back to MVAPICH2:

module switch openmpi mvapich2

or, for new version,

module switch gnu8 gnu
module switch openmpi3 mvapich2

Using the PGI compilers (V16.7 for CUDA 8.0 capable GPU nodes, V19.10 for all others)

(includes OpenMP and CUDA support via pragmas, even for Fortran)

Issue
```
module unload cuda
module load pgi
```
- For Fortran 77, use: pgf77 -V x.f
- For Fortran 95, use: pgf95 -V x.f
- For HPF, use: pghpf -V x.f
- For C++, use: pgCC -V x.c
- For ANSI C, use: pgcc -V x.c
- For debugging, use: pgdbg
- For more compile output, add option: -Minfo=all
- For AMD 64-bit, add option: -tp=barcelona-64
- For OpenMP, add option: -mp
- For OpenACC/CUDA (default: 7.0), add options: -acc
- For OpenACC/CUDA 7.5: -acc -Mcuda=cuda7.5,rdc
- For OpenACC/CUDA 8.0: -acc -Mcuda=cuda8.0,rdc
- For OpenACC/CUDA on specific GPUs, run: pgaccelinfo, then use the respective -ta option in the output for compilation, e.g., -acc -ta=tesla,cc20
  - with filename.f: supports Fortran ACC pragmas (for CUDA), e.g., !$acc parallel
  - with filename.c: supports C ACC pragmas (for CUDA), e.g., #pragma acc parallel
Slides and excercises on MPI+GPU programming with CUDA and OpenACC
PGI Documentation
- PGI Accelerator Quick Reference
- Full OpenACC Support in V13.4
OpenAcc Documentation
- OpenAcc Quick Reference
- OpenAcc V2.0

PGI+Open MPI issue:

module switch mvapich2 openmpi/1.10.4
export LMOD_FAMILY_MPI=openmpi
#compile for C (similar for C++/Fortran)
mpicc ...
#use mpirun or prun to execute
prun ./a.out
mpirun ./a.out

Notice that this OpenMPI version has support for CUDA pointers, RDMA, and GPU Direct

PGI+MVAPICH2: not supported

Dynamic Voltage and Frequency Scaling (DVFS)

Change the frequency/voltage of a core to save energy (without any of with minor loss of performance, depending on how memory-bound an application is)
Use cpupower and its utilities to change processor frequencies
Example for core 0 (requires sudo rights):
- cpupower frequency-info
- sudo cpupower -c 0 frequency-set -f 1200Mhz #set to userspace 1.2GHz
- cpupower frequency-info
- sudo cpupower frequency-set -g ondemand #revert to original settings

Power monitoring

Sets of three compute nodes share a power meter; in such a set, the lowest numbered node has the meter attached (either on the serial port or via USB). In addition, two individual compute nodes have power meters (with different GPUs). See this power wiring diagram to identify which nodes belong to a set. The diagram also indicates if a meter uses serial or USB for a given node. We recommend to explicitly request a reservation for all nodes in a monitored set (see srun commands with host name option). Monitoring at 1Hz is accomplished with the following software tools (on the respective nodes where meters are attached):

export PATH=".:~/bin:/usr/local/bin:/usr/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib"
for serial meters, use:
```
mlogger -p 0 -o
```
for USB meters, use:
```
wattsup ttyUSB0 watts 
```
Linux support for Watts Up serial version
Linux support for Watts Up USB version
- Watts Up Pro Manual (USB) (local copy)

Virtualization with LXD (optionally with X11, VirtualBox, Docker inside)

Container virtualization support is realized via LXD. Please try to use CentOS images as they will take much less space than any other ones since only the differences to the host image need to be stored in the container. Also, do NOT deploy LXD on nodes c[0-19] as they host BeeGFS. LXD/docker has been known to lock up nodes, and if this happens on nodes c[0-19], it would affect other users on other nodes as the BeeGFS file system would not longer be operational. Finally, stop and delete images before you release a node reserved by srun!

lxd init #press enter to select defaults/empty password (encouraged), or choose specific settings (discouraged)
If this does not work, send us email (see above), lxd is sometimes problematic in its setup.
lxc image list images:|grep -i centos #list of centos images
lxc launch images:centos/7/amd64 my-centos #create and start new image
lxc list #see installed/running images
lxc exec my-centos -- /bin/bash #get a shell for running image
- yum install openssh-server
- systemctl start sshd
- passwd #enter root passwords
- #install other useful packges (see CentOS 7 docs), e.g., gcc compiler:
- yum group install "Development Tools"
#from login node, create another session to your compute node, say cXX:
- ssh cXX
- #using the IP from "lcx list", transfer files over the virtual bridge to lxc image:
- scp some-file root@10.196.17.XXX: #or use sftp
lxc stop my-centos #stop the image
lxc config device add my-centos gpu gpu #optionally add GPU support, then you need to install CUDA
lxc start my-centos #start the image
lxc delete my-centos #delete all files of the image
Further instructions for Ubuntu, skip install steps and just look at user commands (lxc)

Notice: Images are installed locally on the node you are running on. If you need identical images on multiple nodes, then write a script to create an image from scratch. You cannot simply copy images as they are in a protected directory.

X11 inside LXD:

lxc exec my-centos -- /bin/bash
- yum -y install openssh-server xauth xeyes
- systemctl start sshd
- useradd myuser
- passwd myuser
- exit
lxc info my-centos|grep eth0 #write down your IP addr, e.g., 10.169.173.239
ssh -X myuser@10.169.173.239
- xeyes #should display on your desktop
- exit

VirtualBox inside LXD (requires X11, see above):

lxc exec my-centos -- /bin/bash
- cd /etc/yum.repos.d
- wget http://download.virtualbox.org/virtualbox/rpm/rhel/virtualbox.repo
- #edit virtualbox.repo
- repo_gpgcheck=0
- yum install VirtualBox-5.0
- useradd myuser
- passwd myuser
- usermod -a -G vboxusers myuser
- exit
lxc info my-centos|grep eth0 #write down your IP addr, e.g., 10.169.173.239
ssh -X myuser@10.169.173.239
- VirtualBox #should display on your desktop
- exit

Docker inside of LXD:

lxc launch ubuntu-daily:16.04 docker
lxc exec docker -- apt update
lxc exec docker -- apt dist-upgrade -y
lxc exec docker -- apt install docker.io -y
lxc exec docker -- docker run --detach --name app carinamarina/hello-world-app
lxc exec docker -- docker run --detach --name web --link app:helloapp -p 80:5000 carinamarina/hello-world-web
lxc list #copy IP for eth0, say 10.178.150.73
curl http://10.178.150.73 #output: The linked container said... "Hello World!"
lxc stop docker
lxc delete docker #if you don't need it anymore

BeeGFS

cd /mnt/beegfs #to access it from compute nodes
mkdir $USER #to create your subdirectory (only needs to be done once)
chmod 700 $USER #to ensure others cannot access you data (only done once)
cd $USER #go to directory where you should place your large files
about 160TB of storage over 16 servers (10TB each)
Currently limited by 1Gpbs switch connection (eth0)
Not protected by RAID, not backed up!

PVFS2 is being retired, please use BeeGFS instead

~~cd /pvfs2/$USER@oss-storage-0-108/pvfs2 #to access it~~
ls -l
mkdir $USER #to create your subdirectory (only needs to be done once)
about 36TB of storage over 4 servers (9.2TB each) under software RAID0
Currently limited by 1Gpbs switch connection (eth1)
Notice: There appears to be a bug in slurm that sometimes makes this fail.
This seems to only happen when you specify "srun -n=X..." for X>1.
If you run into this, find out the nodes allocated to your srun job via "echo $SLURM_NODELIST", exit from srun, and issue a new "srun -w c[XYZ]", where x[XYZ] is the nodelist.
Then exit srun again and issue your original "srun -n=X..." command. After that, you should be able to access pvfs2.

PAPI

module load papi
Reads hardware performance counters
Check supported counters: papi_avail
Edit your source file to define performance counter events, read them and then print or process them, see PAPI API
Add to the Makefile compile options: -I${PAPI_INC}
Add to the Makefile linker options: -L${PAPI_LIB} -lpapi

likwid V5.2.0

Pins threads to specific cores, avoids Linux-based thread migration and may increase NUMA performance, see likwid project for a complete list of tools (power, pinning etc.)
print NUMA core topology: likwid-topology -c -g
Use likwid-pin to pin threads to specific cores
- Example: likwid-pin myapp
- Example: mpirun -np 2 /usr/local/bin/likwid-pin ./myapp
Use likwid-perfctr or likwid-mpirun mearure performance counters, optionally with pinned threads
Others: likwid-mpirun, likwid-powermeter, likwid-setfreq, ...

Hadoop Map-Reduce and Spark

Simple setup of multi-node Hadoop map-reduce with HDFS, see also free AWS setup as an alternative and the original single node and cluster setup. But follow the instructions below for ARC. Other components, e.g., YARN, can be added to the setup below as well (not covered). We'll set up a Hadoop instance with nodes cXXX and cYYY (optionally more), so you should have gotten at least 2 nodes with srun.

#append to your ~/.bashrc:
  module load java
#then issue the command from a shell:
module load java
#distr config, subsitute MY-UNITY-ID with your login ID
mkdir hadoop
cd hadoop
mkdir -p etc/hadoop
cd etc/hadoop
#create file core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://cXXX:9000</value>
    </property>
</configuration>
#create file hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/tmp/MY-UNITY-ID/name/data</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/tmp/MY-UNITY-ID/name</value>
    </property>
</configuration>
#create mapred-site.xml
<configuration>
   <property>
      <name>mapred.job.tracker</name>
      <value>cXXX:9001</value>
   </property>
</configuration>
#create file masters
cXXX
#create file slaves
cXXX
cYYY
#etc.

#for each cXXX/Y/..., create directories
ssh cXXX rm fr /tmp/MY-UNITY-ID
ssh cXXX mkdir -p /tmp/MY-UNITY-ID
ssh cYYY rm fr /tmp/MY-UNITY-ID
ssh cYYY mkdir -p /tmp/MY-UNITY-ID
...

cd ../..
mkdir bin
cd bin
ln -s /usr/local/hadoop/bin/* . 
cd ..
mkdir libexec
cd libexec
ln -s /usr/local/hadoop/libexec/* . 
cd ..
mkdir sbin
cd sbin
ln -s /usr/local/hadoop/sbin/* . 
cd ..
ln -s /usr/local/hadoop/* .
 
export HADOOP_HOME=`pwd`
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
export PATH="$PATH:$HADOOP_HOME/bin"
export CLASSPATH=$CLASSPATH:`hadoop classpath`

#distr test: You will get warnings and ssh errors for some command, igore them for now
hdfs getconf -namenodes
hdfs namenode -format
sbin/start-dfs.sh
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/MY-UNITY-ID
hdfs dfs -put /usr/local/hadoop/etc/hadoop /user/MY-UNITY-ID/input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /user/MY-UNITY-ID/input /user/MY-UNITY-ID/output 'dfs[a-z.]+'
hdfs dfs -get /user/MY-UNITY-ID/output output
cat output/*
sbin/stop-dfs.sh

To get rid of ssh errors, you need to add a secondary node server and other optional services. This is not required, it's an option.

You can also run Spark on top of Hadoop as follows, which will also default to the HDFS file system:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_HOME=/usr/local/spark
export CLASSPATH="$CLASSPATH:$SPARK_HOME/lib/*"
export PATH="$PATH:$SPARK_HOME/bin"
run-example SparkPi 10

Tensorflow (2.4)

Tensorflow (compiled to work w/ Nvidia capability 3.5 or later GPUs, except for Titan X/GTX 1080)
Notice: Do not pip install your own tensorflow, it will not work! Same for keras, use tensorflow.keras instead (already installed).
module load cuda

python3

import tensorflow as tf
msg = tf.constant('TensorFlow 2.0 Hello World')
tf.print(msg)

python3 -m pip3 install --upgrade pip #to upgrade pip
export PYTHONPATH=$PYTHONPATH:$HOME/.local #to include user local packages
pip3 install --user pkg-name #to install other python packages as user
python3 setup.py [install] --user #to install python packages as user via setup scripts
Tensorflow 1.12 with python2 (legacy, being phased out):
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH

python2

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

Tensorflow 1.12 with python3 (legacy, only on c2070/gtx480/gtx680 nodes):
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH

python3

import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

Also available: python/python2 (version 2.7), pip/pip2: use same user install procedure as above
Also available: python3.4 (version 3.4), pip3.4: use same user install procedure as above
Also available: python3 (version 3.6), pip3: use same user install procedure as above

for jupyter-notebook to work, issue

pip3 install jupyter seaborn pydot pydotplus graphviz -U --user
jupyter-notebook --ip=cXX *.ipynb
#from your VPN/campus machine, assuming a port 8888 in the printed URL, issue:
ssh arc -L 8888:cXX:8888
#point your local browser at the localhost:8888 URL returned by jupyter-notebook

PyTorch

PyTorch (compiled to work w/ Nvidia capability 5.0 or later GPUs)
pip3 install -U https://download.pytorch.org/whl/cu110/torch-1.7.1%2Bcu110-cp36-cp36m-linux_x86_64.whl torchvision #V1.7.1+CUDA11

python3

import torch
print(torch.__version__)
torch.cuda.is_available()
print(torch.cuda.current_device())
torch.cuda.get_device_name(0)
print(torch._C._cuda_getCompiledVersion())
print(torch.rand(2,3).cuda()) #fails on sm 3.5 GPUs or earlier

alternative (latest version, currently defaults to CUDA 10): pip3 install -U torch tourchvision
no support for older GPUs, error message: "The NVIDIA driver on your system is too old". (You could build an older version from source, but we don't bother.)

Other Packges

A number of packages have been installed, please check out their location (via: rpm -ql pkg-name) and documentation (see URLs) in this PDF if you need them. (Notice, only the mvapich2/openmpi/gnu variants are installed.) Typically, you can get access to them via:

  module avail # show which modules are available
  module load X
  export |grep X #shows what has been defined
  gcc/mpicc -I${X_INC} -L{X_LIB} -lx #for a library
  ./X #for a tool/program, may be some variant of 'X' depending on toolkit
  module switch X Y #for mutually exclusive modules if X is already loaded
  module unload X
  module info #learn how to use modules

Current list of available modules:

-------------------- /opt/ohpc/pub/moduledeps/gnu-mvapich2 ---------------------
   adios/1.10.0    mpiP/3.4.1              petsc/3.7.0        scorep/3.0
   boost/1.61.0    mumps/5.0.2             phdf5/1.8.17       sionlib/1.7.0
   fftw/3.3.4      netcdf/4.4.1            scalapack/2.0.2    superlu_dist/4.2
   hypre/2.10.1    netcdf-cxx/4.2.1        scalasca/2.3.1     tau/2.26
   imb/4.1         netcdf-fortran/4.4.4    scipy/0.18.0       trilinos/12.6.4

------------------------- /opt/ohpc/pub/moduledeps/gnu -------------------------
   R_base/3.3.1    metis/5.1.0         ocr/1.0.1          pdtoolkit/3.22
   gsl/2.2.1       mvapich2/2.2 (L)    openblas/0.2.19    superlu/5.2.1
   hdf5/1.8.17     numpy/1.11.1        openmpi/1.10.4

------------------------- /opt/ohpc/admin/modulefiles --------------------------
   spack/0.8.17

-------------------------- /opt/ohpc/pub/modulefiles ---------------------------
   EasyBuild/2.9.0        java                  pgi-llvm
   autotools       (L)    ohpc           (L)    pgi-nollvm
   cuda            (L)    openmpi3/3.1.4        prun/1.1
   gnu/5.4.0       (L)    papi/5.4.3            prun/1.3        (L,D)
   gnu8/8.3.0             pgi/19.10             valgrind/3.11.0

Advanced topics (pending)

For all other topics, access is restricted. Request a root password. Also, read this documentation, which is only accessible from selected NCSU labs.

This applies to:

booting your own kernel
installing your own OS

Known Problems

Consult the FAQ. If this does not help, then please report your problem.

References:

A User's Guide to MPI by Peter Pacheco
Debugging: Gdb only works on one task with MPI, you need to "attach" to other tasks on the respective nodes. We don't have totalview (an MPI-aware debugger). You can also use printf debugging, of course. If your program SEGVs, you can set ulimit -c unlimited and run the mpi program again, which will create one or more core dump files (per rank) named "core.PID", which you can then debug: gdb binary and then core core.PID.

ARC: A Root Cluster for Research into Scalable Computer Systems

Old ARC Cluster

Hardware Status

READ THIS BEFORE YOU RUN ANYTHING: Running Jobs (via Slurm)

Hardware

System Status

Software

Obtaining an Account

Accessing the Cluster

Using OpenMP (via gcc/g++/gfortran)

Running CUDA Programs (Versions 8.0, 10.0, 11.1)

Running MPI Programs with MVAPICH2 and gcc/g++/gfortran (Default)

Running MPI Programs with Open MPI and gcc/g++/gfortran (Alternative)

Using the PGI compilers (V16.7 for CUDA 8.0 capable GPU nodes, V19.10 for all others)

(includes OpenMP and CUDA support via pragmas, even for Fortran)

Dynamic Voltage and Frequency Scaling (DVFS)

Power monitoring

Virtualization with LXD (optionally with X11, VirtualBox, Docker inside)

X11 inside LXD:

VirtualBox inside LXD (requires X11, see above):

Docker inside of LXD:

BeeGFS

PVFS2 is being retired, please use BeeGFS instead

PAPI

likwid V5.2.0

Hadoop Map-Reduce and Spark

Tensorflow (2.4)

PyTorch

Other Packges

Advanced topics (pending)

Known Problems

References:

Additional references: