[gridengine users] Partial Solution: message: ... reports running job ... that was not supposed to be there - killing
stuartb at 4gh.net
Thu Aug 4 20:02:57 UTC 2011
For the mailing list archives: I have more information about this
problem and a workaround which seems to be helping.
I'm seeing an issue where SGE appears to be killing all jobs with (in
the qmaster messages file):
07/09/2011 02:14:07|worker|betsy-qmaster|E|execd at bc098.fda.gov reports running job (16648.32/master) in queue "green at bc098.fda.gov" that was not supposed to be there - killing
All jobs are killed on all nodes in the cluster. This occurs about 15
minutes after a node dies.
I have (qconf -sconf) settings:
qmaster_params ENABLE_RESCHEDULE_KILL=true \
Running SUN SGE 6.2u5.
Compute nodes are diskless and do not mount a shared sge_root.
My partial solution was to restore reschedule_unknown and
qmaster_params to their default values:
This seems to have solved my immediate problem. I changed both
variables and didn't attempt to see which specific setting was causing
What remains is still the original problem which caused me to set
these variables in the first place.
If a node dies or is rebooted SGE does not do anything about hung jobs
when the node comes back online. The jobs continue to appear in
the queue as if they where running.
This may be related to my using diskless nodes where the local spool
directory is cleared on reboot. I will be looking into putting the
execd spool files on a shared directory in the future which may
address this problem.
More information about the users