[gridengine users] jobs stuck in transitioning state

bergman at merctech.com bergman at merctech.com
Fri Sep 27 20:21:35 UTC 2019


We're having a problem with submit scripts not being transferred to exec
nodes and jobs being stuck in the [t]ransitioning state.

The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.

We are using classic spooling. On the compute nodes, the spool directory
	/var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/
exists, is owned by user 'sge' (running the execd), is writeable, and
has space.


There is successful communication between the qmaster and execd hosts:
		
	qping works in both directions

	jobs submitted as binaries (-b y) run correctly

	directives from the master to the execd (for example, to delete jobs) work

If I read the qmaster debug logs correctly, it looks like the qmaster isn't able to send the submit script to the compute node:

     1	    worker001     debiting 8589934592.000000 of h_vmem on host 2115fmn001.foobar.local for 1 slots
     2	    worker001     debiting 4000000000.000000 of tmpfree on host 2115fmn001.foobar.local for 1 slots
     3	    worker001     debiting 1.000000 of jobs on queue all.q for 1 slots
     4	    worker001     debiting 1.000000 of slots on queue all.q for 1 slots
     5	    worker001     user doesn't match
     6	    worker001     user doesn't match
     7	    worker001     queue doesn't match
     8	    worker001     queue doesn't match
     9	    worker001     user doesn't match
    10	    worker001     user doesn't match
    11	    worker001     spooling job 9899430.1 <null>
    12	    worker001     Making dir "jobs/00/0989/9430/1-4096/1"
    13	    worker001     retval = 0
    14	    worker001     spooling job 9899430.1 <null>
    15	    worker001     Making dir "jobs/00/0989/9430"
    16	    worker001     retval = 0
    17	    worker001     TRIGGER JOB RESEND 9899430/1 in 300 seconds
    18	    worker001     successfully handed off job "9899430" to queue "all.q at 2115fmn001.foobar.local"
    19	    worker001     NO TICKET DELIVERY


We don't see corresponding log messages on the client.


What mechanism is used by SGE to transfer submit scripts (something
specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?

What are the system-level requirements for succesfully sending the
submit scripts (for example: same UID for sge across the cluster, same
UID<->username for the user submitting the job across the cluster, etc)?

Any troubleshooting suggestions?

Thanks,

Mark



More information about the users mailing list