[gridengine users] sge_execd dies

Daniel Povey dpovey at gmail.com
Fri Nov 9 05:17:26 UTC 2018


OK, well there's your problem.  You need to increase the start of gid_range
to a value larger than your largest possible 'real' userid: for instance,
10000.
The name is a little confusing.  It needs to be a range that's disjoint
from the range of possible userids.


On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran <jfarran at uci.edu> wrote:

> Hi Dan.
>
> Thank you for the suggestion.   Here is what I have:
>
> # qconf -sconf | grep gid_range
> gid_range                    200-700000
>
> The highest gid is 3135.
> Best,
> Joseph
>
> On 11/8/2018 8:58 PM, Daniel Povey wrote:
>
> Do
> qconf -sconf | grep gid_range
> and check whether any of your users have group id's in that range.  That
> can lead to things being killed.
> Dan
>
>
> On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfarran at uci.edu> wrote:
>
>> Greetings.
>>
>> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>>
>> I am seeing job failures on nodes where the node's sge_execd
>> unexpectedly dies.
>>
>> I ran strace on the nodes sge_execd and it's not of much help.   It
>> always end with
>>
>>     +++ killed by SIGKILL +++
>>
>> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
>> memory issues.  The sge_qmaster on the head node is never affected and
>> it runs just fine.  The issue is on the client's sge_execd and 80% of nodes
>> are not affected, only some 20% of the nodes.
>>
>> Here are some sge settings:
>>
>> qmaster_params               MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
>> execd_params                 ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>>                              H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity,
>> \
>>                              S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>>                              H_MAXPROC=infinity,S_LOCKS=infinity, \
>>                              H_LOCKS=infinity,
>> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>>
>> max_aj_instances             2000
>> max_aj_tasks                 0
>> max_u_jobs                   900000
>> max_jobs                     900000
>> max_advance_reservations     300
>>
>> I also tried playing with vm settings to:
>>
>>     /sbin/sysctl vm.overcommit_ratio=100
>>     /sbin/sysctl vm.overcommit_memory=2
>>
>> But it has not been of much help - sge_execd keeps dying.
>>
>> Any help on how I can track down what is causing the node client
>> sge_execd to die?
>>
>> Joseph
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20181109/7d021a85/attachment-0001.html>


More information about the users mailing list