[gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.
Mark Dixon
m.c.dixon at leeds.ac.uk
Tue Mar 15 15:17:25 UTC 2011
On Tue, 15 Mar 2011, Erik Soyez wrote:
> Mark, thanks a lot for your reply!
>
> Do you have any idea, under which circumstances that happens or what
> configuration details could be responsable? Did you use tight mpi
> integration (the problem has never occured before with loose mpi
> integration)? Although (re-soft-)starting the execds helped, it could
> also be a qmaster problem, because it hit the entire cluster within a
> few hours. Or maybe each execd had just run the 200th job after some
> time (which means that it will happen again after the next 200 jobs
> on each node). I might experiment with smaller gid ranges and see
> if it happens any sooner.
>
> Erik Soyez.
As I said, I've been meaning to look at it closer before making a proper
bug report.
On our system, we're using tight integration. We're also making users
specify an h_rt value. Once the h_rt value has expired, regardless of
whether the job has completed or not, the log on the relevant execd starts
logging messages like:
failed to deliver signal 20 to job 1460921.1 task 40.c1s3b11n1 for KILL (shepherd with pid 3863): No such file or directory
(Note the "40.c1s3b11n1". This is a slave task of a tightly-integrated
parallel job.)
As jobs continue to execute on the system, these messages mount up. GID
starvation is only part of it: you also start playing Russian roulette
with those ex-shepherd PIDs that the execd keeps on trying to kill. This
means that simply increasing the GID range isn't a good answer.
There seems to be a more severe version of the problem that happens as our
execds is get close to their GID limits, where a job ends, the shepherd
creates the usage file in the client spool, but doesn't send it to the
qmaster. You end up with an unkillable job reported in qstat. Again, the
kludge is to (soft) restart the execd, which notices the usage file and
cleans-up the job and accounting data.
To keep a lid on these problems, we (soft) restart the execds on all the
compute nodes a couple of times a week.
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
More information about the users
mailing list