[gridengine users] sgemaster crash

William Hay w.hay at ucl.ac.uk
Fri Feb 5 15:52:47 UTC 2016


On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>    Hi,
> 
>     
> 
>    this week I have observed the (6.2u5) sgemaster crashing several times on
>    one of our sites.
> 
>    The last message in the "messages" file was always like this:
> 
>     
> 
>    02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>    list !!!
> 
>     
> 
>    Automatic migration to the alternate master hosts (as define in the shadow
>    host list) also failed, with the new sge_qmaster also crashing (after one
>    minute or less).
> 
>    Only after several attempts I was able to start the master again, but not
>    without having some queues damaged (jobs being lost).
> 
>     
> 
>    This has never happened before since I took over the SGE admin role in our
>    company more than four years ago, and the messages file does not provide
>    an obvious reason. Sometimes I see a line like this before crashing:
> 
>     
> 
>    02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
>    existing job 5335536 of user ...
Is the jobid consistent?  The most common cause of qmaster crashes in my experience
is a corrupted job spool.  Normal procedure is to stop the qmaster and manually delete
the job from the spool (traditional spool) before restarting.


>    If anybody has a good idea what I could look into, I'd appreciate this a
>    lot.
> 
>    Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).

William




More information about the users mailing list