ARC:
A
Root
Cluster for Research into Scalable Computer Systems
Useful Information
- Pin threads to cores (only with Gnu compilers): export
GOMP_CPU_AFFINITY="0-15:1", see
GOMP_CPU_AFFINITY
- More explicit thread pinning: use sched_setaffinity, see man
sched_setaffinity and Take charge of processor affinity
- Pin Open MPI tasks to cores (can be used in conjunction
with OpenMP threads):
Open
MPI core pinning: mpirun -rankfile rankfile ...
- Pin Open MPI tasks to cores (should NOT be used in conjunction
with OpenMP threads):
Open
MPI core pinning: mpirun --bind-to-core ...
- Note: MVAPICH2 binds threads automatically
- Monitor thread/process pinning: ps -L -eo pid,tid,nlwp,tty,comm,psr
- IP over Infiniband (IPoIB): use cXXX-ib as hostname
- SDP over Infiniband (faster than IPoIB): use cXXX-ub
as hostname.
Then execute your a.out binary as: LD_PRELOAD=libsdp.so a.out ...
Or for with MPI: LD_PRELOAD=libsdp.so mpirun -mca
btl_tcp_if_include ib0 -mca btl tcp,self env LD_PRELOAD=libsdp.so ...
(works with MPI_THREAD_MULTIPLE)
- not working:
all communication via TCP over Inifiniband with Open MPI:
-
sed 's/compute-/compute\-ibnet-/' $PBS_NODEFILE >hostfile
-
mpirun -mca btl_tcp_if_include ib0 -mca btl tcp,self -hostfile hostfile ...
- MPI via TCP over Inifiniband, MPI runtime via TCP over Ethernet with Open MPI (works with MPI_THREAD_MULTIPLE):
- mpirun -mca btl_tcp_if_include ib0 -mca btl tcp,self ...
- all communication via TCP over Ethernet with Open MPI (required
for MPI_THREAD_MULTIPLE as Infiniband natively does not work with it):
- mpirun -mca btl_tcp_if_include eth0 -mca btl tcp,self ...
- IB monitoring (user:
monitor, password: monitor)
Hardware Specs
AMD Opteron Magny Core: 16 cores on 108+ compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with 6128 Opteron
processors with 8 cores per socket (16 cores
per node).
6128 Opteron Processor:
- 128KB split (64KB+64KB) I+D L1 caches, 2-way associative, 64B/line (private)
- 512KB L2 cache, 16-way associative, 64B/line (private)
- 12MB L3 cache, 96-way associative, 64B/line (shared)
- 800MHz-2GHz Core speed
- 3200MHz bus speed (HT)
- 1x1600MHz DDR3 memory controller, 4 channels, ECC (1800MHz Northbridge)
- PCIe 2.0 (16 lanes)
- 80 Watts peak power (1.3V)
- 45nm SOI CMOS fab
- G34 socket
Nodes:
Sandybridge: 16 cores (32 hyper-threaded) on 4 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Xeon Sandy Bridge EP E5-2650
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2650 Xeon Processor:
- 64KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 20MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 2-MB or 4-MB pages, 4-way associative, 32 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 4-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
- 800MHz-2GHz Core speed / 2.8GHz turbo (1-2 cores)
- 2x 4GHz bus speed (QPI)
- 1x 1600MHz DDR3 memory controller, 4 channels, ECC
- PCIe 3.0
- 95 Watts peak power (0.6-1.35V)
- 32nm SOI CMOS fab
- FCLGA2011 socket
Nodes:
Ivybridge: 16 cores (32 hyper-threaded) on 2 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Ivy Bridge E5-2667v2
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2667v2 Xeon Processor:
- 64KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 25MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 2-MB or 4-MB pages, 4-way associative, 32 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 4-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
- 800MHz-3.3GHz Core speed / 4.0GHz turbo (1 core)
- 2x 4GHz bus speed (QPI)
- 1x 1866MHz DDR3 memory controller, 4 channels
- PCIe 3.0
- 130 Watts peak power (0.65-1.3V)
- 22nm SOI CMOS fab
- FCLGA2011 socket
Nodes:
Broadwell: 16 cores (32 hyper-threaded) on 17 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Broadwell E5-2620v4
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
E5-2620v4 Xeon Processor:
- 32KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 256KB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 20MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- Data TLB0: 1-GB, 4-way associative, 4 entries
- Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- Instruction TLB: 4-KB pages, 8-way set associative, 64 entries
- L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- Shared 2nd-level TLB: 4 KB / 2 MB pages, 6-way set associative, 1536 entries
- 800MHz-2.1GHz Core speed / 3.0GHz turbo
- 2x 4GHz bus speed (QPI)
- 1x 2133MHz DDR4 memory controller, 4 channels, ECC
- PCIe 3.0 (40 lanes)
- 85 Watts peak power (0.65-1.3V)
- 14nm SOI CMOS fab
- LGA2011-3 socket
Nodes:
skylake: 16 cores (32 hyper-threaded) on 4 compute nodes integrated by
Advanced HPC. Machines are
2-way SMPs with Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
processors with 8 cores per socket (16 cores
per node, 32 hyper-threaded cores).
4110 Xeon Processor:
- 32KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 1MB L2 cache, 8-way associative, 64B/line (private, non-inclusive)
- 11MB L3 cache, 20-way associative, 64B/line (shared, inclusive)
- ?Data TLB0: 1-GB, 4-way associative, 4 entries
- ?Data TLB: 4-KB Pages, 4-way set associative, 64 entries
- ?Instruction TLB: 4-KB pages, 8-way set associative, 64 entries
- ?L2 TLB: 1-MB, 4-way set associative, 64-byte line size
- ?Shared 2nd-level TLB: 4 KB / 2 MB pages, 6-way set associative, 1536 entries
- 800MHz-2.1GHz Core speed / 3.0GHz turbo
- ?2x 4GHz bus speed (QPI)
- 1x 2400MHz DDR4 memory controller, 4 channels, ECC
- PCIe 3.0 (48 lanes)
- 85 Watts peak power (0.65-1.3V)
- 14nm SOI CMOS fab
- FCLGA3647 socket
Nodes:
rome: 16 cores (32 hyper-threaded) on 2 compute nodes integrated by
Advanced HPC. Machines are
single socket AMD EPYC Rome 7302P 3.0 GHz
processors with 16 cores (16 cores
per node, 32 hyper-threaded cores).
7302P AMD Processor:
- 1MB split (512KB+512KB) I+D L1 caches, 8-way associative, 64-byte line size
- 8MB L2 cache, 8-way associative, 64-byte line size, write-back
- 128MB L3 cache, 64-byte line size
- TLB: 3072 4K page entries
- ?Data TLB:
- ?Instruction TLB:
- ?L2 TLB:
- ?Shared 2nd-level TLB:
- 1.5GHz-3.0GHz Core speed
- 3.2GHz bus speed?
- 8x 3200MHz DDR4 memory controller, 8 channels, ECC
- PCIe 4.0 (128 lanes)
- 155 Watts peak power
- 7/14nm SOI CMOS fab
- SP3 socket
Nodes:
KNL: 64 cores (256 hyper-threaded cores) on 2 compute nodes
Intel
Xeon Phi X200 KNL (64 cores, 4X hyperthreaded)Ninja Developer
platform and configurations
Intel Xeon Phi CPU 7210 processor
- 32KB split (32KB+32KB) I+D L1 caches, 8-way associative, 64B/line (private)
- 32MB (32x1MB) L2 cache, 16-way associative, 64B/line (shared)
- 800MHz-1.3GHz Core speed / 1.4GHz turbo (all cores) / 1.5GHz (1-2 cores)
- 1x 2133MHz memory controller, 6 channels
- PCIe 3.0 (36 lanes)
- 215 Watts peak power (0.55-1.125V)
- 14nm SOI CMOS fab
- LGA3647 socket
Nodes: compute-0-XXX, XXX=124..125 via old arc (use ssh -Y... for X11)
- Supermicro
SuperServer 5038K-i pedestal, liquid cooled
- 96GB DRAM + 16GB MCDRAM
- 223.6GB Intel SSDSC2BB24
- 4TB Hitachi HGST Ultrastar 7k6000 SAS
- 128MB cache 12Gbps
- CentOS 7.2, Linux kernel 3.10.0-327.18.2
- Intel® Parallel Studio XE Cluster Edition through the Developer Access Program, see
slides
- To access icc/ifort/openmpi/advixe-cl/amplxe-cl/inspxe-cl/etc.
- export PATH=/opt/intel/bin:/opt/intel/compilers_and_libraries/linux/mpi/bin64:/usr/lib64/openmpi/bin:/opt/intel/advisor/bin64:/opt/intel/vtune_amplifier_xe/bin64:/opt/intel/inspector/bin64:$PATH
- export LD_LIBRARY_PATH=/opt/intel/lib/mic:/opt/intel/lib/intel64:/opt/intel/compilers_and_libraries/linux/mpi/lib64:/usr/lib64/openmpi/lib:/opt/intel/advisor/lib64:/opt/intel/vtune_amplifier_xe/lib64:/opt/intel/inspector/lib64:$LD_LIBRARY_PATH
- export MANPATH=/opt/intel/man/common:/opt/intel/compilers_and_libraries/linux/mpi/man:/opt/intel/advisor/man:/opt/intel/vtune_amplifier_xe/man:/opt/intel/inspector/man:$MANPATH
- Also check other components under /opt/intel
Altera Arria 10 FPGA on c82
Intel/Altera DE5a-Net-DDR4
- source /opt/intelFPGA_pro/17.1/source
- mkdir -p ~/linux/altera/de5a-net-ddr4/17.1/
- cd ~/linux/altera/de5a-net-ddr4/17.1/
- rsync -va --exclude bin/channelizer --exclude bin/fft1d --exclude bin/jpeg_decoder --exclude bin/matrix_mult --exclude bin/sobel_filter --exclude bin/vector_add --exclude bin/mandelbrot --exclude bin/hello_world --exclude bin/video_downscaling /opt/intelFPGA_pro/17.1/hld/board/de5a_net_ddr4/tests .
- cp -R /opt/intelFPGA_pro/17.1/hld/board/de5a_net_ddr4/tests .
- cd tests/hello_world
- #edit Makefile to change g++ -> g++44
- make
- bin/host
- cd ../fft1d/
- aoc device/fft1d.cl -o bin/fft1d.aocx -fpc -no-interleaving=default -board=de5a_net_ddr4 -v
- #edit Makefile to change g++ -> g++44
- make
- bin/host
- quartus #SDK GUI
- #for OpenVINO
- source /opt/intel/openvino/bin/setupvars.sh
- cd ~/
- tar xzf /opt/intel/openvino/openvino-user.tgz
- cd ~/openvino/deployment_tools/demo
- bash demo_squeezenet_download_convert_run.sh
- bash demo_security_barrier_camera.sh
Altera Stratix 10 FPGA on c27 (8GB DRAM) and on c28 (4GB DRAM)
Intel/Altera DE10-Pro
- source /opt/intelFPGA_pro/18.1/source
- mkdir -p ~/linux/altera/de10_pro/18.1/
- cd ~/linux/altera/de10_pro/18.1/
- rsync -va --exclude bin/vector_add --exclude bin/mandelbrot_kernel --exclude bin/hello_world /opt/intelFPGA_pro/18.1/hld/board/de10_pro/tests .
- cd tests/hello_world
- #edit Makefile to change g++ -> g++44
- make
- bin/host
- cd ../vector_add/
- aoc -seed=2 device/vector_add.cl -o bin/vector_add.aocx -board=s10_gh2e2 -v
- #also working: aoc -fast-compile device/vector_add.cl -o bin/vector_add.aocx -board=s10_gh2e2 -v
- #also working: aoc -bsp-flow=flat device/vector_add.cl -o bin/vector_add.aocx -board=s10_gh2e2 -v
- #not working: aoc device/vector_add.cl -o bin/vector_add.aocx -board=s10_gh2e2 -v
- #omitting -bsp-flow=flat / -fast-compile causes
#00:14:55 Internal Error: Sub-system: QHD, File: /quartus/comp/qhd/qhd_database_model_utils.cpp, Line: 947
#Routing preservation failure!
#Appears to be a BSP bug that Terasic hasn't fixed and cannot
#reproduce and the Intel patch does not fix.
- #edit Makefile to change g++ -> g++44
- make
- bin/host
- quartus #SDK GUI
- #for OpenVINO
- source /opt/intel/openvino/bin/setupvars.sh
- cd ~/
- tar xzf /opt/intel/openvino/openvino-user.tgz
- cd ~/openvino/deployment_tools/demo
- bash demo_squeezenet_download_convert_run.sh
- bash demo_security_barrier_camera.sh
Useful links:
Related papers: