[gridengine users] sgemaster crash

Manfred Selz Manfred.Selz at diasemi.com
Fri Feb 5 15:02:52 UTC 2016


this week I have observed the (6.2u5) sgemaster crashing several times on one of our sites.
The last message in the "messages" file was always like this:

02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other list !!!

Automatic migration to the alternate master hosts (as define in the shadow host list) also failed, with the new sge_qmaster also crashing (after one minute or less).
Only after several attempts I was able to start the master again, but not without having some queues damaged (jobs being lost).

This has never happened before since I took over the SGE admin role in our company more than four years ago, and the messages file does not provide an obvious reason. Sometimes I see a line like this before crashing:

02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer existing job 5335536 of user ...

I have also looked at the local host specific messages files.
If anybody has a good idea what I could look into, I'd appreciate this a lot.
Is there an efficient way to trace (strace?) the master process?


Manfred Selz
Senior CAD Engineer
Direct Dial: +49 (0)7021 805-562
Manfred.Selz at diasemi.com<mailto:Manfred.Selz at diasemi.com>| www.diasemi.com<http://www.diasemi.com/>
Dialog Semiconductor GmbH, Neue Strasse 95, 73230 Kirchheim/Teck-Nabern, Germany


Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Jean-Michel Richard
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.

Please consider the environment before printing this e-mail

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160205/9af93e8b/attachment.html>

More information about the users mailing list