[gridengine users] sge_execd dies
dpovey at gmail.com
Fri Nov 9 05:17:26 UTC 2018
OK, well there's your problem. You need to increase the start of gid_range
to a value larger than your largest possible 'real' userid: for instance,
The name is a little confusing. It needs to be a range that's disjoint
from the range of possible userids.
On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran <jfarran at uci.edu> wrote:
> Hi Dan.
> Thank you for the suggestion. Here is what I have:
> # qconf -sconf | grep gid_range
> gid_range 200-700000
> The highest gid is 3135.
> On 11/8/2018 8:58 PM, Daniel Povey wrote:
> qconf -sconf | grep gid_range
> and check whether any of your users have group id's in that range. That
> can lead to things being killed.
> On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfarran at uci.edu> wrote:
>> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>> I am seeing job failures on nodes where the node's sge_execd
>> unexpectedly dies.
>> I ran strace on the nodes sge_execd and it's not of much help. It
>> always end with
>> +++ killed by SIGKILL +++
>> But I cannot tell what killed it. Dmesg has nothing of segfault nor
>> memory issues. The sge_qmaster on the head node is never affected and
>> it runs just fine. The issue is on the client's sge_execd and 80% of nodes
>> are not affected, only some 20% of the nodes.
>> Here are some sge settings:
>> qmaster_params MONITOR_TIME=0:1:00 LOG_Monitor_Message=0
>> execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>> S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>> H_MAXPROC=infinity,S_LOCKS=infinity, \
>> max_aj_instances 2000
>> max_aj_tasks 0
>> max_u_jobs 900000
>> max_jobs 900000
>> max_advance_reservations 300
>> I also tried playing with vm settings to:
>> /sbin/sysctl vm.overcommit_ratio=100
>> /sbin/sysctl vm.overcommit_memory=2
>> But it has not been of much help - sge_execd keeps dying.
>> Any help on how I can track down what is causing the node client
>> sge_execd to die?
>> users mailing list
>> users at gridengine.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users