ARC:
A
Root
Cluster for Research into Scalable Computer Systems
Useful Information
- Pin threads to cores (only with Gnu compilers): export
GOMP_CPU_AFFINITY="0-15:1", see
GOMP_CPU_AFFINITY
- More explicit thread pinning: use sched_setaffinity, see man
sched_setaffinity and Take charge of processor affinity
- Pin Open MPI tasks to cores (can be used in conjunction
with OpenMP threads):
Open
MPI core pinning: mpirun -rankfile rankfile ...
- Pin Open MPI tasks to cores (should NOT be used in conjunction
with OpenMP threads):
Open
MPI core pinning: mpirun --bind-to-core ...
- Note: MVAPICH2 binds threads automatically
- Monitor thread/process pinning: ps -L -eo pid,tid,nlwp,tty,comm,psr
- IP over Infiniband (IPoIB): use cXXX-ib as hostname
- SDP over Infiniband (faster than IPoIB): use cXXX-ub
as hostname.
Then execute your a.out binary as: LD_PRELOAD=libsdp.so a.out ...
Or for with MPI: LD_PRELOAD=libsdp.so mpirun -mca
btl_tcp_if_include ib0 -mca btl tcp,self env LD_PRELOAD=libsdp.so ...
(works with MPI_THREAD_MULTIPLE)
- not working:
all communication via TCP over Inifiniband with Open MPI:
-
sed 's/compute-/compute\-ibnet-/' $PBS_NODEFILE >hostfile
-
mpirun -mca btl_tcp_if_include ib0 -mca btl tcp,self -hostfile hostfile ...
- MPI via TCP over Inifiniband, MPI runtime via TCP over Ethernet with Open MPI (works with MPI_THREAD_MULTIPLE):
- mpirun -mca btl_tcp_if_include ib0 -mca btl tcp,self ...
- all communication via TCP over Ethernet with Open MPI (required
for MPI_THREAD_MULTIPLE as Infiniband natively does not work with it):
- mpirun -mca btl_tcp_if_include eth0 -mca btl tcp,self ...
- IB monitoring (user:
monitor, password: monitor)
Hardware Specs
Sandybridge: 16 cores (32 hyper-threaded) on 4 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Xeon Sandy Bridge EP E5-2650
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2650 Xeon Processor:
- 64KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 20MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 2-MB or 4-MB pages, 4-way associative, 32 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 4-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
- 800MHz-2GHz Core speed / 2.8GHz turbo (1-2 cores)
- 2x 4GHz bus speed (QPI)
- 1x 1600MHz DDR3 memory controller, 4 channels, ECC
- PCIe 3.0
- 95 Watts peak power (0.6-1.35V)
- 32nm SOI CMOS fab
- FCLGA2011 socket
Nodes:
Ivybridge: 16 cores (32 hyper-threaded) on 2 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Ivy Bridge E5-2667v2
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2667v2 Xeon Processor:
- 64KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 25MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 2-MB or 4-MB pages, 4-way associative, 32 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 4-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
- 800MHz-3.3GHz Core speed / 4.0GHz turbo (1 core)
- 2x 4GHz bus speed (QPI)
- 1x 1866MHz DDR3 memory controller, 4 channels
- PCIe 3.0
- 130 Watts peak power (0.65-1.3V)
- 22nm SOI CMOS fab
- FCLGA2011 socket
Nodes:
Broadwell: 16 cores (32 hyper-threaded) on 17 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Broadwell E5-2620v4
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2620v4 Xeon Processor:
- 32KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 20MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 1-GB, 4-way associative, 4 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 8-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB / 2 MB pages, 6-way set associative, 1536 entries
- 800MHz-2.1GHz Core speed / 3.0GHz turbo
- 2x 4GHz bus speed (QPI)
- 1x 2133MHz DDR4 memory controller, 4 channels, ECC
- PCIe 3.0 (40 lanes)
- 85 Watts peak power (0.65-1.3V)
- 14nm SOI CMOS fab
- LGA2011-3 socket
Nodes:
skylake: 16 cores (32 hyper-threaded) on 4 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
4110 Xeon Processor:
- 32KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 1MB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 11MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- ?Data TLB0: 1-GB, 4-way associative, 4 entries
- ?Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- ?Instruction TLB: 4-KB pages, 8-way set associative, 64 entries
- ?L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- ?Shared 2nd-level TLB: 4 KB / 2 MB pages, 6-way set associative, 1536 entries
- 800MHz-2.1GHz Core speed / 3.0GHz turbo
- ?2x 4GHz bus speed (QPI)
- 1x 2400MHz DDR4 memory controller, 4 channels, ECC
- PCIe 3.0 (48 lanes)
- 85 Watts peak power (0.65-1.3V)
- 14nm SOI CMOS fab
- FCLGA3647 socket
Nodes:
rome: 16 cores (32 hyper-threaded) on 2 compute nodes integrated by
Advanced HPC. Machines are
single socket AMD EPYC Rome 7302P 3.0 GHz
processors with 16 cores (16 cores
per node, 32 hyper-threaded cores).
7302P AMD Processor:
- 1MB split (512KB+512KB) I+D L1 caches, 8-way associative, 64-byte line size
- 8MB L2 cache, 8-way associative, 64-byte line size, write-back
- 128MB L3 cache, 64-byte line size
- TLB: 3072 4K page entries
- ?Data TLB:
- ?Instruction TLB:
- ?L2 TLB:
- ?Shared 2nd-level TLB:
- 1.5GHz-3.0GHz Core speed
- 3.2GHz bus speed?
- 8x 3200MHz DDR4 memory controller, 8 channels, ECC
- PCIe 4.0 (128 lanes)
- 155 Watts peak power
- 7/14nm SOI CMOS fab
- SP3 socket
Nodes: