[gridengine users] Subordinate queue suspended jobs not restarting
ajoplin at arlut.utexas.edu
Thu Feb 27 23:28:54 UTC 2014
New member here with a couple questions - they're unrelated, so I'll
make separate posts.
First off, we're runnig grid engine version OGS/GE 2011.11. I recently
finished setting up a hierarchy of three queues - high, medium, and low
priority. Medium is subordinate to high, and low to medium. The queues
span multiple hosts, but are all configured identically except for the
subordination (and a complex that I use to specify which queue to get
For the most part, this works great - I can submit a large number of
long jobs to the low priority queue, and they get suspended whenever
someone else uses the medium priority queue. But the first problem I'm
running into is that occasionally, the suspended jobs don't seem to be
restarted. According to qstat, they have been (status "r"), but when I
check the corresponding process on the execute host, I see a process
status "T", as if the SIGCONT signal was never sent. I can manually
send a SIGCONT to the job, and it finishes processing, but otherwise it
does nothing until I notice it (usually next day). Other times a job
will show a status "r" in qstat, but I can't even find the process on
the host it's supposed to be on.
Has anyone seen this behavior before? I've tried recreating the
problem, but I can't seem to reliably reproduce it. It seems to just
happen "sometimes" when one of my long jobs gets suspended.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5432 bytes
Desc: S/MIME Cryptographic Signature
More information about the users