[gridengine users] jobs randomly die

hiller hiller at mpia-hd.mpg.de
Tue May 14 13:52:05 UTC 2019


~> qconf -srqs
No resource quota set found

'dmesg -T' does not give an oom or other weird messages. 

'free -h' looks good and also looked good at 'kill time':

~> free -h
              total        used        free      shared  buff/cache   available
Mem:           188G        1.0G        185G        2.6M        2.0G        186G
Swap:           49G          0B         49G

Full output of qacct:
~>  qacct -j 635659
==============================================================
qname        all.q               
hostname     karun10             
group        users               
owner        calj                
project      NONE                
department   defaultdepartment   
jobname      dsc_gdr2            
jobnumber    635659              
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Mon May 13 13:06:58 2019
start_time   Mon May 13 13:06:56 2019
end_time     Mon May 13 18:31:42 2019
granted_pe   make                
slots        1                   
failed       100 : assumedly after job
exit_status  137                  (Killed)
ru_wallclock 19486s
ru_utime     0.048s
ru_stime     0.006s
ru_maxrss    11.566KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    7885                
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   8                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     142                 
ru_nivcsw    3                   
cpu          19305.760s
mem          7.463TBs
io           70.435GB
iow          0.000s
maxvmem      532.004MB
arid         undefined
ar_sub_time  undefined
category     -l hostname=karun10 -pe make 1


Thanks, ulrich


On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time limits reached. What is the whole output from 'qacct -j JOBID'?
> 
> Cheers,
> -Hugh
> 
> -----Original Message-----
> From: users-bounces at gridengine.org <users-bounces at gridengine.org> On Behalf Of hiller
> Sent: Tuesday, May 14, 2019 9:02 AM
> To: users at gridengine.org
> Subject: Re: [gridengine users] jobs randomly die
> 
> Hi,
> nope, there are no oom messages in the journal.
> Regards, ulrich
> 
> 
> On 5/14/19 12:49 PM, Arnau wrote:
>> Hi,
>>
>> _maybe_ the OOM killer killed the job ? a look to messages will give you an answer (I've seen this in my cluster).
>>
>> HTH,
>> Arnau
>>
>> El mar., 14 may. 2019 a las 12:37, hiller (<hiller at mpia-hd.mpg.de <mailto:hiller at mpia-hd.mpg.de>>) escribió:
>>
>>     Dear all,
>>     i have a problem that jobs sent to gridengine randomly die.
>>     The gridengine version is 8.1.9
>>     The OS is opensuse 15.0
>>     The gridengine messages file says:
>>     05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job
>>     05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 assumedly after job because: job 635659.1 died through signal KILL (9)
>>
>>     qacct -j 635659 says:
>>     failed       100 : assumedly after job
>>     exit_status  137                  (Killed)
>>
>>
>>     The was no kill triggered by the user. Also there are no other limitations, neither ulimit nor in the gridengine queue
>>     The 'qconf -sq all.q' command gives:
>>     s_rt                  INFINITY
>>     h_rt                  INFINITY
>>     s_cpu                 INFINITY
>>     h_cpu                 INFINITY
>>     s_fsize               INFINITY
>>     h_fsize               INFINITY
>>     s_data                INFINITY
>>     h_data                INFINITY
>>     s_stack               INFINITY
>>     h_stack               INFINITY
>>     s_core                INFINITY
>>     h_core                INFINITY
>>     s_rss                 INFINITY
>>     h_rss                 INFINITY
>>     s_vmem                INFINITY
>>     h_vmem                INFINITY
>>
>>     Years ago there were some threads about the same issue, but i did not find a solution.
>>
>>     Does somebody have a hint what i can do or check/debug?
>>
>>     With kind regards and many thanks for any help, ulrich
>>     _______________________________________________
>>     users mailing list
>>     users at gridengine.org <mailto:users at gridengine.org>
>>     https://gridengine.org/mailman/listinfo/users
>>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 


More information about the users mailing list