[gridengine users] jobs randomly die

Feng Zhang prod.feng at gmail.com
Tue May 14 14:03:07 UTC 2019


looks like your job used a lot of ram:

mem          7.463TBs
io           70.435GB
iow          0.000s
maxvmem      532.004MB

Do you have CGROUP to limit resource of jobs?

Best,

Feng

On Tue, May 14, 2019 at 9:53 AM hiller <hiller at mpia-hd.mpg.de> wrote:
>
> ~> qconf -srqs
> No resource quota set found
>
> 'dmesg -T' does not give an oom or other weird messages.
>
> 'free -h' looks good and also looked good at 'kill time':
>
> ~> free -h
>               total        used        free      shared  buff/cache   available
> Mem:           188G        1.0G        185G        2.6M        2.0G        186G
> Swap:           49G          0B         49G
>
> Full output of qacct:
> ~>  qacct -j 635659
> ==============================================================
> qname        all.q
> hostname     karun10
> group        users
> owner        calj
> project      NONE
> department   defaultdepartment
> jobname      dsc_gdr2
> jobnumber    635659
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Mon May 13 13:06:58 2019
> start_time   Mon May 13 13:06:56 2019
> end_time     Mon May 13 18:31:42 2019
> granted_pe   make
> slots        1
> failed       100 : assumedly after job
> exit_status  137                  (Killed)
> ru_wallclock 19486s
> ru_utime     0.048s
> ru_stime     0.006s
> ru_maxrss    11.566KB
> ru_ixrss     0.000B
> ru_ismrss    0.000B
> ru_idrss     0.000B
> ru_isrss     0.000B
> ru_minflt    7885
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   8
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     142
> ru_nivcsw    3
> cpu          19305.760s
> mem          7.463TBs
> io           70.435GB
> iow          0.000s
> maxvmem      532.004MB
> arid         undefined
> ar_sub_time  undefined
> category     -l hostname=karun10 -pe make 1
>
>
> Thanks, ulrich
>
>
> On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> > It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time limits reached. What is the whole output from 'qacct -j JOBID'?
> >
> > Cheers,
> > -Hugh
> >
> > -----Original Message-----
> > From: users-bounces at gridengine.org <users-bounces at gridengine.org> On Behalf Of hiller
> > Sent: Tuesday, May 14, 2019 9:02 AM
> > To: users at gridengine.org
> > Subject: Re: [gridengine users] jobs randomly die
> >
> > Hi,
> > nope, there are no oom messages in the journal.
> > Regards, ulrich
> >
> >
> > On 5/14/19 12:49 PM, Arnau wrote:
> >> Hi,
> >>
> >> _maybe_ the OOM killer killed the job ? a look to messages will give you an answer (I've seen this in my cluster).
> >>
> >> HTH,
> >> Arnau
> >>
> >> El mar., 14 may. 2019 a las 12:37, hiller (<hiller at mpia-hd.mpg.de <mailto:hiller at mpia-hd.mpg.de>>) escribió:
> >>
> >>     Dear all,
> >>     i have a problem that jobs sent to gridengine randomly die.
> >>     The gridengine version is 8.1.9
> >>     The OS is opensuse 15.0
> >>     The gridengine messages file says:
> >>     05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job
> >>     05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 assumedly after job because: job 635659.1 died through signal KILL (9)
> >>
> >>     qacct -j 635659 says:
> >>     failed       100 : assumedly after job
> >>     exit_status  137                  (Killed)
> >>
> >>
> >>     The was no kill triggered by the user. Also there are no other limitations, neither ulimit nor in the gridengine queue
> >>     The 'qconf -sq all.q' command gives:
> >>     s_rt                  INFINITY
> >>     h_rt                  INFINITY
> >>     s_cpu                 INFINITY
> >>     h_cpu                 INFINITY
> >>     s_fsize               INFINITY
> >>     h_fsize               INFINITY
> >>     s_data                INFINITY
> >>     h_data                INFINITY
> >>     s_stack               INFINITY
> >>     h_stack               INFINITY
> >>     s_core                INFINITY
> >>     h_core                INFINITY
> >>     s_rss                 INFINITY
> >>     h_rss                 INFINITY
> >>     s_vmem                INFINITY
> >>     h_vmem                INFINITY
> >>
> >>     Years ago there were some threads about the same issue, but i did not find a solution.
> >>
> >>     Does somebody have a hint what i can do or check/debug?
> >>
> >>     With kind regards and many thanks for any help, ulrich
> >>     _______________________________________________
> >>     users mailing list
> >>     users at gridengine.org <mailto:users at gridengine.org>
> >>     https://gridengine.org/mailman/listinfo/users
> >>
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> >
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users



More information about the users mailing list