[gridengine users] sgemaster crash

Manfred Selz Manfred.Selz at diasemi.com
Wed Feb 10 07:33:34 UTC 2016


Hi William, Alex,

thank you very much for your replies and suggestions in Vol. 62, Issue 2.
I took up the point of deleting potentially offending jobs (actually, as most jobs were gone anyway, the entire job spool) from the internal job list in SGE, and so far (since Monday morning) the SGE cluster has been stable again.

I will continue to observe and follow up if more incidents come up.

Regards,
Manfred


Message: 2
Date: Fri, 5 Feb 2016 15:52:47 +0000
From: William Hay <w.hay at ucl.ac.uk>
To: <users at gridengine.org>
Subject: Re: [gridengine users] sgemaster crash
Message-ID: <20160205155247.GA7017 at hylic.rits-isd.ucl.ac.uk>
Content-Type: text/plain; charset="us-ascii"

On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>    Hi,
>
>
>
>    this week I have observed the (6.2u5) sgemaster crashing several times on
>    one of our sites.
>
>    The last message in the "messages" file was always like this:
>
>
>
>    02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>    list !!!
>
>
>
>    Automatic migration to the alternate master hosts (as define in the shadow
>    host list) also failed, with the new sge_qmaster also crashing (after one
>    minute or less).
>
>    Only after several attempts I was able to start the master again, but not
>    without having some queues damaged (jobs being lost).
>
>
>
>    This has never happened before since I took over the SGE admin role in our
>    company more than four years ago, and the messages file does not provide
>    an obvious reason. Sometimes I see a line like this before crashing:
>
>
>
>    02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
>    existing job 5335536 of user ...
Is the jobid consistent?  The most common cause of qmaster crashes in my experience is a corrupted job spool.  Normal procedure is to stop the qmaster and manually delete the job from the spool (traditional spool) before restarting.


>    If anybody has a good idea what I could look into, I'd appreciate this a
>    lot.
>
>    Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).

William



------------------------------

Message: 3
Date: Fri, 5 Feb 2016 13:41:38 -0800
From: Alex Chekholko <chekh at stanford.edu>
To: users at gridengine.org
Subject: Re: [gridengine users] sgemaster crash
Message-ID: <56B51712.5070602 at stanford.edu>
Content-Type: text/plain; charset=windows-1252; format=flowed

IME you are hitting some kind of rare bug.

Last time we had a thing like this it was because a user was specifying many hundreds of jobids in the hold_jid parameter.

Before that, it had something to do with parallel jobs not cleaning up quite right, and IIRC disabling the scheduling reporting parameters fixed it.

In each case, the "easiest" way is to delete your job spool and restart your qmaster and then monitor closely to try to figure out which user's jobs it is that makes it crash.  And then get the user to modify their job parameters till your qmaster doesn't crash anymore :)



On 02/05/2016 07:52 AM, William Hay wrote:
> On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>>     Hi,
>>
>>
>>
>>     this week I have observed the (6.2u5) sgemaster crashing several times on
>>     one of our sites.
>>
>>     The last message in the "messages" file was always like this:
>>
>>
>>
>>     02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>>     list !!!
>>
>>
>>
>>     Automatic migration to the alternate master hosts (as define in the shadow
>>     host list) also failed, with the new sge_qmaster also crashing (after one
>>     minute or less).
>>
>>     Only after several attempts I was able to start the master again, but not
>>     without having some queues damaged (jobs being lost).
>>
>>
>>
>>     This has never happened before since I took over the SGE admin role in our
>>     company more than four years ago, and the messages file does not provide
>>     an obvious reason. Sometimes I see a line like this before crashing:
>>
>>
>>
>>     02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
>>     existing job 5335536 of user ...
> Is the jobid consistent?  The most common cause of qmaster crashes in
> my experience is a corrupted job spool.  Normal procedure is to stop
> the qmaster and manually delete the job from the spool (traditional spool) before restarting.
>
>
>>     If anybody has a good idea what I could look into, I'd appreciate this a
>>     lot.
>>
>>     Is there an efficient way to trace (strace?) the master process?
> You could enable the built in debugging (man sge_dl).
>
> William
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>

--
Alex Chekholko chekh at stanford.edu 347-401-4860

________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.

Please consider the environment before printing this e-mail






More information about the users mailing list