ARC: A Root Cluster for Research into Scalable Computer Systems


  1. To detect broken nodes, consult your error file (a.out.e[job#]) after a job has finished / after a hanging job was deleted (qdel) or try an interactive job:
    qsub -l nodes=64:ppn=16 -I
    mpirun -np 800 xwu1
    It might show:
    [compute-0-100.local:11353] [[INVALID],INVALID] ORTE_ERROR_LOG: Error
      in file runtime/orte_init.c at line 120
    You can then avoid node 100 and let us know that we need to fix this.

    There are 2 ways to avoid a node:

    1. List all hosts explicitly (very tedious).
    2. Run an empty job that keeps the bad node busy, then start you real job.
      echo sleep 600 | qsub -l host=compute-0-100,walltime=1000
  2. MPI_THREAD_MULTIPLE does not work for OpenIB under Open MPI, it only works for TCP.
  3. CUDA and mpirun are disabled for the login node.
  4. error/log files missing / passphrase asked on ssh: You entered a passphrase during your first login after the account was created. Fix:
    rm -fr ~/.ssh
    login to arc again
    enter EMPTY passphrase (just hit return)