[gridengine users] execution daemon on host * didn't accept task

Reuti reuti at staff.uni-marburg.de
Wed Nov 16 10:03:40 UTC 2011


Hi,

Am 16.11.2011 um 04:29 schrieb Vang Le:

> Hello GridUsers, 
> My grid is running, it can deliver jobs, but they only run on one nodes at a time. 
> When I tried running with mpirun in a batch script, i get errors like "execution daemon on host  <hostname> didn't accept task" as shown at the bottom of this email.  

can you please check, whether your Open MPI was built with support for SGE properly:

$ ompi_info | grep grid
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)

A simple `hostname` should work. You installed this version of Open MPI on all machines? What does your PE definition look like: "control_slaves TRUE" is set?

-- Reuti


> I can run mpirun outside of sge without any problems. 
> I am suspecting that when mpirun is put inside the sge batch script, it can not communicate with exec nodes successfully.  
> 
> 
> My system information:
> 3 servers running Ubuntu Lucid Lynx with recompiled openmpi to support gridengine. SGE was installed via Ubuntu repository setup correct environmental variables. 
> I also setup non-password ssh access for openmpi user account, which is the same account that I use to submit sge batch. 
> 
> 
> Any help is very much appreciated. 
> 
> Vang. 
> 
> 
> 
> 
> ============ERROR================
> error: executing task of job 63 failed: execution daemon on host "node1" didn't accept task
> error: executing task of job 63 failed: execution daemon on host "submithost" didn't accept task
> --------------------------------------------------------------------------
> A daemon (pid 13317) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> 
> 
> ============CONTENT OF SGE BATCH SUBMIT==============
> 
> #!/bin/bash
> 
> # run at current working directory
> #$ -cwd
> #$ -V
> # Specify the shell for this job
> #$ -S /bin/bash
> #$ -pe test_pe 5
> #$ -P test1
> 
> # Merge the standard output and standard error
> #$ -j y
> 
> # Specify the location of the output messages
> #$ -o messages.txt
> 
> #---------Customization part starts below -------------
> # Customization 
> # Which email should the start running and edning of this job be emailed to 
> # 
> #$ -M <my_gmail_id>@gmail.com
> #$ -m be
> 
> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
> 
> mpirun -np $NSLOTS hostname
> mpirun -np $NSLOTS ~/hello
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list