[gridengine users] jobs randomly die
hiller at mpia-hd.mpg.de
Tue May 14 13:01:37 UTC 2019
nope, there are no oom messages in the journal.
On 5/14/19 12:49 PM, Arnau wrote:
> _maybe_ the OOM killer killed the job ? a look to messages will give you an answer (I've seen this in my cluster).
> El mar., 14 may. 2019 a las 12:37, hiller (<hiller at mpia-hd.mpg.de <mailto:hiller at mpia-hd.mpg.de>>) escribió:
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 assumedly after job because: job 635659.1 died through signal KILL (9)
> qacct -j 635659 says:
> failed 100 : assumedly after job
> exit_status 137 (Killed)
> The was no kill triggered by the user. Also there are no other limitations, neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
> Years ago there were some threads about the same issue, but i did not find a solution.
> Does somebody have a hint what i can do or check/debug?
> With kind regards and many thanks for any help, ulrich
> users mailing list
> users at gridengine.org <mailto:users at gridengine.org>
More information about the users