[gridengine users] jobs randomly die

Daniel Povey dpovey at gmail.com
Tue May 14 18:58:07 UTC 2019


I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.

qconf -sconf | grep  gid_range
gid_range                    50000-51000


On Tue, May 14, 2019 at 10:42 AM Reuti <reuti at staff.uni-marburg.de> wrote:

> AFAICS the sent kill by SGE happens after a task returned already with an
> error. SGE would in this case use the kill signal to be sure to kill all
> child processes. Hence the question would  be: what was the initial command
> in the job script, and what output/error did it generate?
>
> -- Reuti
>
> > Am 14.05.2019 um 11:36 schrieb hiller <hiller at mpia-hd.mpg.de>:
> >
> > Dear all,
> > i have a problem that jobs sent to gridengine randomly die.
> > The gridengine version is 8.1.9
> > The OS is opensuse 15.0
> > The gridengine messages file says:
> > 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed -
> killing job
> > 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10
> assumedly after job because: job 635659.1 died through signal KILL (9)
> >
> > qacct -j 635659 says:
> > failed       100 : assumedly after job
> > exit_status  137                  (Killed)
> >
> >
> > The was no kill triggered by the user. Also there are no other
> limitations, neither ulimit nor in the gridengine queue
> > The 'qconf -sq all.q' command gives:
> > s_rt                  INFINITY
> > h_rt                  INFINITY
> > s_cpu                 INFINITY
> > h_cpu                 INFINITY
> > s_fsize               INFINITY
> > h_fsize               INFINITY
> > s_data                INFINITY
> > h_data                INFINITY
> > s_stack               INFINITY
> > h_stack               INFINITY
> > s_core                INFINITY
> > h_core                INFINITY
> > s_rss                 INFINITY
> > h_rss                 INFINITY
> > s_vmem                INFINITY
> > h_vmem                INFINITY
> >
> > Years ago there were some threads about the same issue, but i did not
> find a solution.
> >
> > Does somebody have a hint what i can do or check/debug?
> >
> > With kind regards and many thanks for any help, ulrich
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20190514/fbaec9f9/attachment.html>


More information about the users mailing list