[gridengine users] error in parallel run openmpi for gridengine

Yong Wu wuy069 at gmail.com
Fri Apr 7 07:42:30 UTC 2017


Hi all,
  I submit a parallel ORCA (Quantum Chemistry Program) job on multiple
nodes in Rocks SGE, and get the follow error information,
--------------------------------------------------------------------------
A hostfile was provided that contains at least one node not
present in the allocation:

  hostfile:  test.nodes
  node:      compute-0-67

If you are operating in a resource-managed environment, then only
nodes that are in the allocation can be used in the hostfile. You
may find relative node syntax to be a useful alternative to
specifying absolute node names see the orte_hosts man page for
further information.
--------------------------------------------------------------------------

The ORCA program compiled with openmpi, here, I used orte parallel
environment in Rocks SGE.
$ qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

The submitted sge script:
  #!/bin/bash
  # Job submission script:
  # Usage: qsub <this_script>
  #
  #$ -cwd
  #$ -j y
  #$ -o test.sge.o$JOB_ID
  #$ -S /bin/bash
  #$ -N test
  #$ -pe orte 24
  #$ -l h_vmem=3.67G
  #$ -l h_rt=1240:00:00

  # go to work dir
  cd $SGE_O_WORKDIR

  # load the module env for ORCA
  source /usr/share/Modules/init/sh
  module load intel/compiler/2011.7.256
  source /share/apps/mpi/openmpi2.0.2-ifort/bin/mpivars.sh
  export orcapath=/share/apps/orca4.0.0
  export RSH_COMMAND="ssh"

  #creat scratch dir on nfs dir
  tdir=/home/data/$SGE_O_LOGNAME/$JOB_ID
  mkdir -p $tdir

  #cat $PE_HOSTFILE

  PeHostfile2MachineFile()
  {
     cat $1 | while read line; do
        # echo $line
        host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
        nslots=`echo $line|cut -f2 -d" "`
        i=1
        while [ $i -le $nslots ]; do
           # add here code to map regular hostnames into ATM hostnames
           echo $host
           i=`expr $i + 1`
        done
     done
  }

  PeHostfile2MachineFile $PE_HOSTFILE >> $tdir/test.nodes

  cp ${SGE_O_WORKDIR}/test.inp $tdir

  cd $tdir

  echo "ORCA job start at" `date`

  time $orcapath/orca test.inp > ${SGE_O_WORKDIR}/test.log

  rm ${tdir}/test.inp
  rm ${tdir}/test.*tmp 2>/dev/null
  rm ${tdir}/test.*tmp.* 2>/dev/null
  mv ${tdir}/test.* $SGE_O_WORKDIR

  echo "ORCA job finished at" `date`

  echo "Work Dir is : $SGE_O_WORKDIR"

  rm -rf $tdir
  rm $SGE_O_WORKDIR/test.sge


However, the job can run normally on multiple nodes in Torque.

Can someone help me? Thanks very much!

Best regards!
Yong Wu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20170407/3a77a2a3/attachment.html>


More information about the users mailing list