[gridengine users] Partial Solution: message: ... reports running job ... that was not supposed to be there - killing

Stuart Barkley stuartb at 4gh.net
Thu Aug 4 20:02:57 UTC 2011


For the mailing list archives:  I have more information about this
problem and a workaround which seems to be helping.

To summarize:

I'm seeing an issue where SGE appears to be killing all jobs with (in
the qmaster messages file):

  07/09/2011 02:14:07|worker|betsy-qmaster|E|execd at bc098.fda.gov reports running job (16648.32/master) in queue "green at bc098.fda.gov" that was not supposed to be there - killing

All jobs are killed on all nodes in the cluster.  This occurs about 15
minutes after a node dies.

I have (qconf -sconf) settings:
  load_report_time             00:00:40
  max_unheard                  00:05:00
  reschedule_unknown           00:15:00
  qmaster_params               ENABLE_RESCHEDULE_KILL=true \
                               ENABLE_RESCHEDULE_SLAVE=true
Other Notes:
  Running SUN SGE 6.2u5.
  Compute nodes are diskless and do not mount a shared sge_root.

My partial solution was to restore reschedule_unknown and
qmaster_params to their default values:

  reschedule_unknown           00:00:00
  qmaster_params               none

This seems to have solved my immediate problem.  I changed both
variables and didn't attempt to see which specific setting was causing
the problem.

Remaining issue:

What remains is still the original problem which caused me to set
these variables in the first place.

If a node dies or is rebooted SGE does not do anything about hung jobs
when the node comes back online.  The jobs continue to appear in
the queue as if they where running.

This may be related to my using diskless nodes where the local spool
directory is cleared on reboot.  I will be looking into putting the
execd spool files on a shared directory in the future which may
address this problem.

Stuart Barkley



More information about the users mailing list