[gridengine users] jobs stuck in transitioning state
reuti at staff.uni-marburg.de
Fri Sep 27 21:32:43 UTC 2019
Am 27.09.2019 um 22:21 schrieb bergman at merctech.com:
> We're having a problem with submit scripts not being transferred to exec
> nodes and jobs being stuck in the [t]ransitioning state.
Did this issue to start out of the blue?
> The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.
But these are separate clusters, or you using both versions in one and the same cluster or just tried both on one cluster?
> We are using classic spooling. On the compute nodes, the spool directory
> exists, is owned by user 'sge' (running the execd), is writeable, and
> has space.
Is the execd running as sge or initially as root? It must be run at root to be able to switch to any user but switches to the admin user:
$ ps -e f -o user,ruser,group,rgroup,command
sgeadmin root gridware root /usr/sge/bin/lx24-em64t/sge_execd
root root root root \_ /bin/sh /usr/sge/cluster/tmpspace.sh
sgeadmin root gridware root \_ sge_shepherd-311391 -bg
> There is successful communication between the qmaster and execd hosts:
> qping works in both directions
> jobs submitted as binaries (-b y) run correctly
> directives from the master to the execd (for example, to delete jobs) work
> If I read the qmaster debug logs correctly, it looks like the qmaster isn't able to send the submit script to the compute node:
> 1 worker001 debiting 8589934592.000000 of h_vmem on host 2115fmn001.foobar.local for 1 slots
> 2 worker001 debiting 4000000000.000000 of tmpfree on host 2115fmn001.foobar.local for 1 slots
> 3 worker001 debiting 1.000000 of jobs on queue all.q for 1 slots
> 4 worker001 debiting 1.000000 of slots on queue all.q for 1 slots
> 5 worker001 user doesn't match
> 6 worker001 user doesn't match
> 7 worker001 queue doesn't match
> 8 worker001 queue doesn't match
> 9 worker001 user doesn't match
> 10 worker001 user doesn't match
> 11 worker001 spooling job 9899430.1 <null>
> 12 worker001 Making dir "jobs/00/0989/9430/1-4096/1"
> 13 worker001 retval = 0
> 14 worker001 spooling job 9899430.1 <null>
> 15 worker001 Making dir "jobs/00/0989/9430"
> 16 worker001 retval = 0
> 17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds
> 18 worker001 successfully handed off job "9899430" to queue "all.q at 2115fmn001.foobar.local"
> 19 worker001 NO TICKET DELIVERY
> We don't see corresponding log messages on the client.
> What mechanism is used by SGE to transfer submit scripts (something
> specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?
It uses its own protocol. No SSH inside the cluster is necessary.
> What are the system-level requirements for succesfully sending the
> submit scripts (for example: same UID for sge across the cluster, same
> UID<->username for the user submitting the job across the cluster, etc)?
More information about the users