[gridengine users] I can't run mpi jobs correctly

mahbube rustaee rustaee at gmail.com
Wed Nov 23 08:07:24 UTC 2011


Excuse me Mr. Reuti for your time and I 'm appreciate with your kindness.

I compiled intel mpi openmpi-1.4.2 by --with-sge and it works via CLI
correctly (openmpi integrate with sge).
I modified PE mpifillamd such:

pe_name            mpifillamd
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     FALSE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

I compiled my program with new open mpi.
and my script is:
#!/bin/bash
#$ -S /bin/bash
#$ -N Det2
#$ -cwd
#$ -j y
#$ -pe mpifillamd 100
. $HOME/.intelbash
. ~/openmpi_intel_1.4.2.sh
which mpirun
echo $LD_LIBRARY_PATH
mpirun -n $NSLOTS  mpi-integ-sge-intel.comp

Output is:

/home/mrustaee/PF/openmpi-1.4.2/intel/bin/mpirun
/home/mrustaee/PF/openmpi-1.4.2/intel/lib:/opt/intel/Compiler/11.1/069/lib/intel64:/opt/intel/Compiler/11.1/069/ipp/em64t/sharedlib:/opt/intel/Compiler/11.1/069/mkl/lib/em64t:/opt/intel/Compiler/11.1/069/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/opt/intel/Compiler/11.1/069/lib/intel64:/opt/intel/Compiler/11.1/069/ipp/em64t/sharedlib:/opt/intel/Compiler/11.1/069/mkl/lib/em64t:/opt/intel/Compiler/11.1/069/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
error: executing task of job 1227 failed: execution daemon on host
"amd-7-4.local" didn't accept task
--------------------------------------------------------------------------
A daemon (pid 31144) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
error: executing task of job 1227 failed: execution daemon on host
"amd-7-3.local" didn't accept task
mpirun: clean termination accomplished

-----------------------------------------
LIBRARY_PATH do not confilit other. intelbash shell sets intel library path
and openmpi_intel_1.4.2.sh sets open mpi library path!

--------------------------------------------------------------------------------------
when I qsub a script  without -pe option and I run my job by hostfile  such:
#!/bin/bash
#$ -S /bin/bash
#$ -N Det2
#$ -cwd
#$ -j y
. $HOME/.intelbash
. ~/openmpi_intel_1.4.2.sh
which mpirun
echo $LD_LIBRARY_PATH
mpirun -n 300 --hostfile machines  mpi-integ-sge-intel.comp

 everything is ok!. machines is a list of hosts that qsub couldn't run this
program on it.

what are happening !?
I can't catch that error!

Thx so much
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20111123/a084547b/attachment.html>


More information about the users mailing list