[gridengine users] SGE-6.2u5: Sudden death of *all* cluster jobs.

Mark Dixon m.c.dixon at leeds.ac.uk
Tue Mar 15 15:17:25 UTC 2011


On Tue, 15 Mar 2011, Erik Soyez wrote:

> Mark, thanks a lot for your reply!
>
> Do you have any idea, under which circumstances that happens or what
> configuration details could be responsable?  Did you use tight mpi
> integration (the problem has never occured before with loose mpi
> integration)?  Although (re-soft-)starting the execds helped, it could
> also be a qmaster problem, because it hit the entire cluster within a
> few hours.  Or maybe each execd had just run the 200th job after some
> time (which means that it will happen again after the next 200 jobs
> on each node).  I might experiment with smaller gid ranges and see
> if it happens any sooner.
>
> Erik Soyez.

As I said, I've been meaning to look at it closer before making a proper 
bug report.

On our system, we're using tight integration. We're also making users 
specify an h_rt value. Once the h_rt value has expired, regardless of 
whether the job has completed or not, the log on the relevant execd starts 
logging messages like:

failed to deliver signal 20 to job 1460921.1 task 40.c1s3b11n1 for KILL (shepherd with pid 3863): No such file or directory

(Note the "40.c1s3b11n1". This is a slave task of a tightly-integrated 
parallel job.)

As jobs continue to execute on the system, these messages mount up. GID 
starvation is only part of it: you also start playing Russian roulette 
with those ex-shepherd PIDs that the execd keeps on trying to kill. This 
means that simply increasing the GID range isn't a good answer.

There seems to be a more severe version of the problem that happens as our 
execds is get close to their GID limits, where a job ends, the shepherd 
creates the usage file in the client spool, but doesn't send it to the 
qmaster. You end up with an unkillable job reported in qstat. Again, the 
kludge is to (soft) restart the execd, which notices the usage file and 
cleans-up the job and accounting data.

To keep a lid on these problems, we (soft) restart the execds on all the 
compute nodes a couple of times a week.

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------


More information about the users mailing list