[gridengine users] jobs stuck in transitioning state

Reuti reuti at staff.uni-marburg.de
Fri Sep 27 21:32:43 UTC 2019


Am 27.09.2019 um 22:21 schrieb bergman at merctech.com:

> We're having a problem with submit scripts not being transferred to exec
> nodes and jobs being stuck in the [t]ransitioning state.

Did this issue to start out of the blue?

> The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.

But these are separate clusters, or you using both versions in one and the same cluster or just tried both on one cluster?

> We are using classic spooling. On the compute nodes, the spool directory
> 	/var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/
> exists, is owned by user 'sge' (running the execd), is writeable, and
> has space.

Is the execd running as sge or initially as root? It must be run at root to be able to switch to any user but switches to the admin user:

$ ps -e f -o user,ruser,group,rgroup,command
sgeadmin root     gridware root     /usr/sge/bin/lx24-em64t/sge_execd
root     root     root     root      \_ /bin/sh /usr/sge/cluster/tmpspace.sh
sgeadmin root     gridware root      \_ sge_shepherd-311391 -bg

> There is successful communication between the qmaster and execd hosts:
> 	qping works in both directions
> 	jobs submitted as binaries (-b y) run correctly
> 	directives from the master to the execd (for example, to delete jobs) work
> If I read the qmaster debug logs correctly, it looks like the qmaster isn't able to send the submit script to the compute node:
>     1	    worker001     debiting 8589934592.000000 of h_vmem on host 2115fmn001.foobar.local for 1 slots
>     2	    worker001     debiting 4000000000.000000 of tmpfree on host 2115fmn001.foobar.local for 1 slots
>     3	    worker001     debiting 1.000000 of jobs on queue all.q for 1 slots
>     4	    worker001     debiting 1.000000 of slots on queue all.q for 1 slots
>     5	    worker001     user doesn't match
>     6	    worker001     user doesn't match
>     7	    worker001     queue doesn't match
>     8	    worker001     queue doesn't match
>     9	    worker001     user doesn't match
>    10	    worker001     user doesn't match
>    11	    worker001     spooling job 9899430.1 <null>
>    12	    worker001     Making dir "jobs/00/0989/9430/1-4096/1"
>    13	    worker001     retval = 0
>    14	    worker001     spooling job 9899430.1 <null>
>    15	    worker001     Making dir "jobs/00/0989/9430"
>    16	    worker001     retval = 0
>    17	    worker001     TRIGGER JOB RESEND 9899430/1 in 300 seconds
>    18	    worker001     successfully handed off job "9899430" to queue "all.q at 2115fmn001.foobar.local"
>    19	    worker001     NO TICKET DELIVERY
> We don't see corresponding log messages on the client.
> What mechanism is used by SGE to transfer submit scripts (something
> specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?

It uses its own protocol. No SSH inside the cluster is necessary.

> What are the system-level requirements for succesfully sending the
> submit scripts (for example: same UID for sge across the cluster, same
> UID<->username for the user submitting the job across the cluster, etc)?


-- Reuti

More information about the users mailing list