[gridengine users] Subordinate queue suspended jobs not restarting

Andrew Joplin ajoplin at arlut.utexas.edu
Thu Feb 27 23:28:54 UTC 2014


New member here with a couple questions - they're unrelated, so I'll 
make separate posts.

First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
finished setting up a hierarchy of three queues - high, medium, and low 
priority.  Medium is subordinate to high, and low to medium.  The queues 
span multiple hosts, but are all configured identically except for the 
subordination (and a complex that I use to specify which queue to get 
into).

For the most part, this works great - I can submit a large number of 
long jobs to the low priority queue, and they get suspended whenever 
someone else uses the medium priority queue.  But the first problem I'm 
running into is that occasionally, the suspended jobs don't seem to be 
restarted.  According to qstat, they have been (status "r"), but when I 
check the corresponding process on the execute host, I see a process 
status "T", as if the SIGCONT signal was never sent.  I can manually 
send a SIGCONT to the job, and it finishes processing, but otherwise it 
does nothing until I notice it (usually next day).  Other times a job 
will show a status "r" in qstat, but I can't even find the process on 
the host it's supposed to be on.

Has anyone seen this behavior before?  I've tried recreating the 
problem, but I can't seem to reliably reproduce it.  It seems to just 
happen "sometimes" when one of my long jobs gets suspended.

Thanks!

-- 
Andrew Joplin


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5432 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://gridengine.org/pipermail/users/attachments/20140227/3a8e4d48/attachment.p7s>


More information about the users mailing list