[gridengine users] jobs randomly die

Reuti reuti at staff.uni-marburg.de
Tue May 14 14:41:15 UTC 2019


AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would  be: what was the initial command in the job script, and what output/error did it generate?

-- Reuti

> Am 14.05.2019 um 11:36 schrieb hiller <hiller at mpia-hd.mpg.de>:
> 
> Dear all,
> i have a problem that jobs sent to gridengine randomly die.
> The gridengine version is 8.1.9
> The OS is opensuse 15.0
> The gridengine messages file says:
> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job
> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 assumedly after job because: job 635659.1 died through signal KILL (9)
> 
> qacct -j 635659 says:
> failed       100 : assumedly after job
> exit_status  137                  (Killed)
> 
> 
> The was no kill triggered by the user. Also there are no other limitations, neither ulimit nor in the gridengine queue
> The 'qconf -sq all.q' command gives:
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> Years ago there were some threads about the same issue, but i did not find a solution.
> 
> Does somebody have a hint what i can do or check/debug?
> 
> With kind regards and many thanks for any help, ulrich
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list