[gridengine users] sge_execd dies

Daniel Povey dpovey at gmail.com
Fri Nov 9 04:58:09 UTC 2018


Do
qconf -sconf | grep gid_range
and check whether any of your users have group id's in that range.  That
can lead to things being killed.
Dan


On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran <jfarran at uci.edu> wrote:

> Greetings.
>
> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>
> I am seeing job failures on nodes where the node's sge_execd unexpectedly
> dies.
>
> I ran strace on the nodes sge_execd and it's not of much help.   It
> always end with
>
>     +++ killed by SIGKILL +++
>
> But I cannot tell what killed it.  Dmesg has nothing of segfault nor
> memory issues.  The sge_qmaster on the head node is never affected and it
> runs just fine.  The issue is on the client's sge_execd and 80% of nodes
> are not affected, only some 20% of the nodes.
>
> Here are some sge settings:
>
> qmaster_params               MONITOR_TIME=0:1:00  LOG_Monitor_Message=0
> execd_params                 ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=9096, \
>                              H_DESCRIPTORS=50240,H_MEMORYLOCKED=infinity, \
>                              S_MEMORYLOCKED=infinity S_MAXPROC=infinity, \
>                              H_MAXPROC=infinity,S_LOCKS=infinity, \
>                              H_LOCKS=infinity,
> USE_SMAPS=yes,ENABLE_BINDING=TRUE
>
> max_aj_instances             2000
> max_aj_tasks                 0
> max_u_jobs                   900000
> max_jobs                     900000
> max_advance_reservations     300
>
> I also tried playing with vm settings to:
>
>     /sbin/sysctl vm.overcommit_ratio=100
>     /sbin/sysctl vm.overcommit_memory=2
>
> But it has not been of much help - sge_execd keeps dying.
>
> Any help on how I can track down what is causing the node client sge_execd
> to die?
>
> Joseph
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20181108/1aaecfb7/attachment.html>


More information about the users mailing list