ARC: A Root Cluster for Research into Scalable Computer Systems

Official Annoucement of the ARC Cluster (local copy)
NCSU write-up on the ARC Cluster (local copy)
TechNewsDailyStory (local copy)


main system funded in part by NSF through CRI grant #0958311

Cooling door equipment and installation funded by NCSU CSC; GPUs funded in part by a grant from NCSU ETF funds, and by NVIDIA and HP donations


Old ARC Cluster

Looking for the documentation of the old ARC cluster V2b, or the older ARC cluster V2 or the oldest ARC Cluster V1?

Hardware & Software Status


READ THIS BEFORE YOU RUN ANYTHING: Running Jobs (via Slurm)

(1) Notice: Store large data files in the Beegfs file system and not your home directory. The home directory is for programs and small data files, which should not exceed 40GB altogether. This upper limited is quata controlled, i.e., if exceeding it, you will have to remove files before you can create new ones. Checked via

(2) Once logged into ARC, immediately obtain access to a compute node (interactively) or schedule batch jobs as shown below. Do not execute any other commands on the login node!


Hardware

1280 cores on 80 compute nodes integrated by Advanced HPC. All machines are 2-way SMPs (except for single socket AMD Rome/Milan machines) with either AMD or Intel processors (see below) and a total 16 physical cores per node (32 for c[30-31]).
Nodes:







Networking, Power and Cooling:

Pictures


System Status


Software

All software is 64 bit unless marked otherwise.

Obtaining an Account


Accessing the Cluster


Using OpenMP (via gcc/g++/gfortran)


Running CUDA Programs (Version 12.3)


Running MPI Programs with MVAPICH2 and gcc/g++/gfortran (Default)

  • switch back to OPENMPI4:
    module switch mvapich2 openmpi4
    

    Running MPI Programs with Open MPI and gcc/g++/gfortran (Alternative)


    Using the NVHPC/PGI compilers (V23.7 for CUDA 12.3)

    (includes OpenMP and CUDA support via pragmas, even for Fortran)


    Dynamic Voltage and Frequency Scaling (DVFS)


    Power monitoring

    Sets of three compute nodes share a power meter; in such a set, the lowest numbered node has the meter attached (either on the serial port or via USB). In addition, two individual compute nodes have power meters (with different GPUs). See this power wiring diagram to identify which nodes belong to a set. The diagram also indicates if a meter uses serial or USB for a given node. We recommend to explicitly request a reservation for all nodes in a monitored set (see salloc commands with host name option). Monitoring at 1Hz is accomplished with the following software tools (on the respective nodes where meters are attached):

    Virtualization


    BeeGFS


    PAPI


    likwid


    Big Data software: Hadoop, Spark, Hbase, Storm, Pig, Phoenix, Kafka, Zeppelin, Zookeeper, and Alluxio


    Python


    Tensorflow


    PyTorch


    Other Packges

    A number of packages have been installed, please check out their location (via: rpm -ql pkg-name) and documentation (see URLs) in this PDF if you need them. (Notice, only the mvapich2/openmpi/gnu variants are installed.) Typically, you can get access to them via:
      module avail # show which modules are available
      module load X
      export |grep X #shows what has been defined
      gcc/mpicc -I${X_INC} -L{X_LIB} -lx #for a library
      ./X #for a tool/program, may be some variant of 'X' depending on toolkit
      module switch X Y #for mutually exclusive modules if X is already loaded
      module unload X
      module info #learn how to use modules
    
    Current list of available modules (w/ openmpi4 active, similar lists for other MPI variants):
    ------------------- /opt/ohpc/pub/moduledeps/gnu12-openmpi4 --------------------
       adios/1.13.1        netcdf-fortran/4.6.0    scalapack/2.2.0
       boost/1.80.0        netcdf/4.9.0            scalasca/2.5
       dimemas/5.4.2       omb/6.1                 scorep/7.1
       extrae/3.8.3        opencoarrays/2.10.0     sionlib/1.7.7
       fftw/3.3.10         petsc/3.18.1            slepc/3.18.0
       hypre/2.18.1        phdf5/1.10.8            superlu_dist/6.4.0
       imb/2021.3          pnetcdf/1.12.3          tau/2.31.1
       mfem/4.4            ptscotch/7.0.1          trilinos/13.4.0
       mumps/5.2.1         py3-mpi4py/3.1.3
       netcdf-cxx/4.3.1    py3-scipy/1.5.4
    
    ------------------------ /opt/ohpc/pub/moduledeps/gnu12 ------------------------
       R/4.2.1         mpich/3.4.3-ofi        pdtoolkit/3.25.1
       gsl/2.7.1       mpich/3.4.3-ucx (D)    plasma/21.8.29
       hdf5/1.10.8     mvapich2/2.3.7         py3-numpy/1.19.5
       likwid/5.2.2    openblas/0.3.21        scotch/6.0.6
       metis/5.1.0     openmpi4/4.1.4  (L)    superlu/5.2.1
    
    -------------------------- /opt/ohpc/pub/modulefiles ---------------------------
       EasyBuild/4.6.2                nvhpc-hpcx-cuda12/23.7
       autotools               (L)    nvhpc-hpcx/23.7
       charliecloud/0.15              nvhpc-nompi/23.7
       cmake/3.24.2                   nvhpc/23.7
       cuda                    (L)    ohpc                   (L)
       gnu12/12.2.0            (L)    os
       gnu9/9.4.0                     papi/6.0.0
       hwloc/2.7.0             (L)    prun/2.2               (L)
       libfabric/1.13.0        (L)    singularity/3.7.1
       magpie/2.5                     ucx/1.11.2             (L)
       nvhpc-byo-compiler/23.7        valgrind/3.19.0
       nvhpc-hpcx-cuda11/23.7
    
    

    Advanced topics (pending)

    For all other topics, access is restricted. Request a root password. Also, read this documentation, which is only accessible from selected NCSU labs.

    This applies to:


    Known Problems

    Consult the FAQ. If this does not help, then please report your problem.

    References:

    Additional references: