[gridengine users] Clarification for archives: message: ... reports running job ... that was not supposed to be there - killing
stuartb at 4gh.net
Fri Aug 12 15:52:56 UTC 2011
On Thu, 11 Aug 2011 at 10:31 -0000, Reuti wrote:
> I think the message in the subject happens when there is something
> in the spool directory of the node like
> "$SGE_ROOT/default/spool/node01/jobs/00/0000/515" while there is
> nothing in "active_jobs" any longer. So it can't kill anything.
> Clearing the node's "jobs" directory may resolve it.
Just to be clear (for the archives and future users), the message in
the subject of this thread occurs when the following are set in the
qmaster_params ENABLE_RESCHEDULE_KILL=true \
Unrelated jobs get incorrectly killed on many/most/all other nodes
when a single node hits the 15 minute reschedule_unknown time limit.
The node may have been powered off or may have locked up for other
My "solution" was to just turn these things back off and this is
probably the simplest solution for anyone else seeing this problem.
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
More information about the users