[gridengine users] SoGE file descriptor limit and MAX_DYN_EC

Daniel Povey dpovey at gmail.com
Fri Feb 21 03:55:51 UTC 2020


That's a GridEngine bug whereby the event client ids or whatever they are
called don't properly get cleaned up.
The workaround is to restart the qmaster when that happens.
Be careful, sometimes restarting the service doesn't work and you may need
to kill the process.
At the cluster I used to manage at JHU, we have a process which checks the
output of
qconf -secl
and if it returns a number greater than 900 we restart the qmaster.


On Fri, Feb 21, 2020 at 1:35 AM Lana Deere <lana.deere at gmail.com> wrote:

> On CentOS 7 using  SoGE 8.1.9, I'm getting an error using qsub:
> QSUB:Unable to initialize environment because of error: cannot register
> event client. Only 979 event clients are allowed in the system
>
> Supposedly I have this limit configured much higher:
> root# qconf -sconf | grep MAX_DYN_EC
> qmaster_params               MAX_DYN_EC=25000,gdi_retries=5
>
> However, the qmaster at startup is reporting that it is not honoring the
> limit:
> |nr of dynamic event clients exceeds max file descriptor limit, setting
> MAX_DYN_EC=979
> |qmaster hard descriptor limit is set to 4096
> |qmaster soft descriptor limit is set to 1024
> |qmaster will use max. 1004 file descriptors for communication
> |qmaster will accept max. 979 dynamic event clients
> |starting up SGE 8.1.9 (lx-amd64)
>
> This is surprising to me since my system's file descriptor limit is set
> much higher than 1024/4096:
> root# pwd
> /etc/security/limits.d
> root# cat 99*nofile*conf
> * soft nofile 100000
> * hard nofile 100000
> root# ulimit -a -S | grep 'open files'
> open files                      (-n) 100000
>
> I hacked the script in /etc/init.d which starts the qmaster and it shows
> the higher limit.  However, if I look at /proc/<qmaster pid>/limits I can
> see that it has the lower limits it reports.  What I can't figure out is
> why it is seeing the lower limit.  Anyone know whether there's a
> configuration parameter somewhere overriding the system limit?  Any
> suggestions on how to make it get the system's limit?
>
> Thanks.
>
> .. Lana (lana.deere at gmail.com)
>
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20200221/85a875a0/attachment.html>


More information about the users mailing list