[gridengine users] execution daemon on host * didn't accept task

Reuti reuti at staff.uni-marburg.de
Wed Nov 16 17:13:01 UTC 2011


Am 16.11.2011 um 17:25 schrieb William Hay:

> On 16 November 2011 13:52, Vang Le <lqvang79 at gmail.com> wrote:
>> Hi William and Reuti,
>> Thank you for your suggestions and your time. They are really helpful. I
>> solved almost of my problems.
>> 
>> I installed rsh-redone-client and rsh-redone-server, also I modify my PE so
>> that "control_slaves TRUE" is set. I can run  this part now:
>> 
>> mpirun -np $NSLOTS hostname
>> mpirun -np $NSLOTS ~/hello
>> 
>> However I still can not start interactive PE with: qsh or qrsh. They both
>> said:
>> ---------
>> $ qrsh -pe test_pe 5
>> Your "qrsh" request could not be scheduled, try again later.
>> ---------
>> qsh -pe test_pe 5
>> Your job 50 ("INTERACTIVE") has been submitted
>> waiting for interactive job to be scheduled ...
>> 
>> Your "qsh" request could not be scheduled, try again later.
>> ---------
>> 
>> I googled and there was something mentioned about editing /etc/hosts.equiv
>> file to permit rsh and rlogin without password. However, typing "qconf
>> -mconf" at the management host, I saw this:
>> ----
>> rlogin_daemon                /usr/sbin/sshd -i
>> rlogin_command               /usr/bin/ssh
>> ----
>> 
>> Do I need to change something in the queue and PE to run interactive PE?

Depends. Parallel jobs are always enabled for BATCH, you can only enable or disable INTERACTIVE by setting qtype accordingly as William outlined. As a result of this, it's not possible to have an INTERACTIVE only parallel queue (but you could setup some checking by a JSV and allow only -now y).


> Check qtype in the queue_conf is either INTERACTIVE or BATCH

Sorry to correct this: it can have both entries at the same time too, besides NONE to allow only parallel batch jobs.


NB: Checkpoint jobs face a similar effect: once defined and requested, you can submit them even if `qtype NONE` is set.

-- Reuti


> INTERACTIVE if you want to run without -now n
> 
> William
> 
>> 
>> Regards
>> Vang.
>> 
>> On 11/16/11 11:03 AM, Reuti wrote:
>> 
>> Hi,
>> 
>> Am 16.11.2011 um 04:29 schrieb Vang Le:
>> 
>> Hello GridUsers,
>> My grid is running, it can deliver jobs, but they only run on one nodes at a
>> time.
>> When I tried running with mpirun in a batch script, i get errors like
>> "execution daemon on host  <hostname> didn't accept task" as shown at the
>> bottom of this email.
>> 
>> can you please check, whether your Open MPI was built with support for SGE
>> properly:
>> 
>> $ ompi_info | grep grid
>>                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)
>> 
>> A simple `hostname` should work. You installed this version of Open MPI on
>> all machines? What does your PE definition look like: "control_slaves TRUE"
>> is set?
>> 
>> -- Reuti
>> 
>> 
>> I can run mpirun outside of sge without any problems.
>> I am suspecting that when mpirun is put inside the sge batch script, it can
>> not communicate with exec nodes successfully.
>> 
>> 
>> My system information:
>> 3 servers running Ubuntu Lucid Lynx with recompiled openmpi to support
>> gridengine. SGE was installed via Ubuntu repository setup correct
>> environmental variables.
>> I also setup non-password ssh access for openmpi user account, which is the
>> same account that I use to submit sge batch.
>> 
>> 
>> Any help is very much appreciated.
>> 
>> Vang.
>> 
>> 
>> 
>> 
>> ============ERROR================
>> error: executing task of job 63 failed: execution daemon on host "node1"
>> didn't accept task
>> error: executing task of job 63 failed: execution daemon on host
>> "submithost" didn't accept task
>> --------------------------------------------------------------------------
>> A daemon (pid 13317) died unexpectedly with status 1 while attempting
>> to launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> 
>> 
>> ============CONTENT OF SGE BATCH SUBMIT==============
>> 
>> #!/bin/bash
>> 
>> # run at current working directory
>> #$ -cwd
>> #$ -V
>> # Specify the shell for this job
>> #$ -S /bin/bash
>> #$ -pe test_pe 5
>> #$ -P test1
>> 
>> # Merge the standard output and standard error
>> #$ -j y
>> 
>> # Specify the location of the output messages
>> #$ -o messages.txt
>> 
>> #---------Customization part starts below -------------
>> # Customization
>> # Which email should the start running and edning of this job be emailed to
>> #
>> #$ -M <my_gmail_id>@gmail.com
>> #$ -m be
>> 
>> export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
>> 
>> mpirun -np $NSLOTS hostname
>> mpirun -np $NSLOTS ~/hello
>> 
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 




More information about the users mailing list