[gridengine users] jobs randomly die

MacMullan IV, Hugh hughmac at wharton.upenn.edu
Tue May 14 13:28:04 UTC 2019


It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time limits reached. What is the whole output from 'qacct -j JOBID'?

Cheers,
-Hugh

-----Original Message-----
From: users-bounces at gridengine.org <users-bounces at gridengine.org> On Behalf Of hiller
Sent: Tuesday, May 14, 2019 9:02 AM
To: users at gridengine.org
Subject: Re: [gridengine users] jobs randomly die

Hi,
nope, there are no oom messages in the journal.
Regards, ulrich


On 5/14/19 12:49 PM, Arnau wrote:
> Hi,
> 
> _maybe_ the OOM killer killed the job ? a look to messages will give you an answer (I've seen this in my cluster).
> 
> HTH,
> Arnau
> 
> El mar., 14 may. 2019 a las 12:37, hiller (<hiller at mpia-hd.mpg.de <mailto:hiller at mpia-hd.mpg.de>>) escribió:
> 
>     Dear all,
>     i have a problem that jobs sent to gridengine randomly die.
>     The gridengine version is 8.1.9
>     The OS is opensuse 15.0
>     The gridengine messages file says:
>     05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job
>     05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 assumedly after job because: job 635659.1 died through signal KILL (9)
> 
>     qacct -j 635659 says:
>     failed       100 : assumedly after job
>     exit_status  137                  (Killed)
> 
> 
>     The was no kill triggered by the user. Also there are no other limitations, neither ulimit nor in the gridengine queue
>     The 'qconf -sq all.q' command gives:
>     s_rt                  INFINITY
>     h_rt                  INFINITY
>     s_cpu                 INFINITY
>     h_cpu                 INFINITY
>     s_fsize               INFINITY
>     h_fsize               INFINITY
>     s_data                INFINITY
>     h_data                INFINITY
>     s_stack               INFINITY
>     h_stack               INFINITY
>     s_core                INFINITY
>     h_core                INFINITY
>     s_rss                 INFINITY
>     h_rss                 INFINITY
>     s_vmem                INFINITY
>     h_vmem                INFINITY
> 
>     Years ago there were some threads about the same issue, but i did not find a solution.
> 
>     Does somebody have a hint what i can do or check/debug?
> 
>     With kind regards and many thanks for any help, ulrich
>     _______________________________________________
>     users mailing list
>     users at gridengine.org <mailto:users at gridengine.org>
>     https://gridengine.org/mailman/listinfo/users
> 
_______________________________________________
users mailing list
users at gridengine.org
https://gridengine.org/mailman/listinfo/users



More information about the users mailing list