[gridengine users] jobs randomly die

Hay, William w.hay at ucl.ac.uk
Fri May 17 10:03:03 UTC 2019


On Tue, 2019-05-14 at 10:03 -0400, Feng Zhang wrote:
> looks like your job used a lot of ram:
> 
> mem          7.463TBs
> io           70.435GB
> iow          0.000s
> maxvmem      532.004MB

Not really 532MB isn't a lot of memory these days.  The mem figure is
in TerraByte Seconds which accumulate fairly quickly.  At 512 M you get
a TBs every 2000 seconds or so.  However the fact that it is reporting
these numbers indicates some sort of built in memory limit was enabled.
 Grid Engine won't measure memory usage unless it has some sort of
limit to enforce.

William
> 
> Do you have CGROUP to limit resource of jobs?
> 
> Best,
> 
> Feng
> 
> On Tue, May 14, 2019 at 9:53 AM hiller <hiller at mpia-hd.mpg.de> wrote:
> > 
> > ~> qconf -srqs
> > No resource quota set found
> > 
> > 'dmesg -T' does not give an oom or other weird messages.
> > 
> > 'free -h' looks good and also looked good at 'kill time':
> > 
> > ~> free -h
> >               total        used        free      shared  buff/cache
> >    available
> > Mem:           188G        1.0G        185G        2.6M        2.0G
> >         186G
> > Swap:           49G          0B         49G
> > 
> > Full output of qacct:
> > ~>  qacct -j 635659
> > ==============================================================
> > qname        all.q
> > hostname     karun10
> > group        users
> > owner        calj
> > project      NONE
> > department   defaultdepartment
> > jobname      dsc_gdr2
> > jobnumber    635659
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Mon May 13 13:06:58 2019
> > start_time   Mon May 13 13:06:56 2019
> > end_time     Mon May 13 18:31:42 2019
> > granted_pe   make
> > slots        1
> > failed       100 : assumedly after job
> > exit_status  137                  (Killed)
> > ru_wallclock 19486s
> > ru_utime     0.048s
> > ru_stime     0.006s
> > ru_maxrss    11.566KB
> > ru_ixrss     0.000B
> > ru_ismrss    0.000B
> > ru_idrss     0.000B
> > ru_isrss     0.000B
> > ru_minflt    7885
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   8
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     142
> > ru_nivcsw    3
> > cpu          19305.760s
> > mem          7.463TBs
> > io           70.435GB
> > iow          0.000s
> > maxvmem      532.004MB
> > arid         undefined
> > ar_sub_time  undefined
> > category     -l hostname=karun10 -pe make 1
> > 
> > 
> > Thanks, ulrich
> > 
> > 
> > On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote:
> > > It's a limit being reached, of some sort. Do you have a RQS of
> > > any kind (qconf -srqs)? We see this for job-requested, or system
> > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on
> > > compute nodes often useful), as well as time limits reached. What
> > > is the whole output from 'qacct -j JOBID'?
> > > 
> > > Cheers,
> > > -Hugh
> > > 
> > > -----Original Message-----
> > > From: users-bounces at gridengine.org <users-bounces at gridengine.org>
> > > On Behalf Of hiller
> > > Sent: Tuesday, May 14, 2019 9:02 AM
> > > To: users at gridengine.org
> > > Subject: Re: [gridengine users] jobs randomly die
> > > 
> > > Hi,
> > > nope, there are no oom messages in the journal.
> > > Regards, ulrich
> > > 
> > > 
> > > On 5/14/19 12:49 PM, Arnau wrote:
> > > > Hi,
> > > > 
> > > > _maybe_ the OOM killer killed the job ? a look to messages will
> > > > give you an answer (I've seen this in my cluster).
> > > > 
> > > > HTH,
> > > > Arnau
> > > > 
> > > > El mar., 14 may. 2019 a las 12:37, hiller (<hiller at mpia-hd.mpg.
> > > > de <mailto:hiller at mpia-hd.mpg.de>>) escribió:
> > > > 
> > > >     Dear all,
> > > >     i have a problem that jobs sent to gridengine randomly die.
> > > >     The gridengine version is 8.1.9
> > > >     The OS is opensuse 15.0
> > > >     The gridengine messages file says:
> > > >     05/13/2019 18:31:45|worker|karun|E|master task of job
> > > > 635659.1 failed - killing job
> > > >     05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on
> > > > host karun10 assumedly after job because: job 635659.1 died
> > > > through signal KILL (9)
> > > > 
> > > >     qacct -j 635659 says:
> > > >     failed       100 : assumedly after job
> > > >     exit_status  137                  (Killed)
> > > > 
> > > > 
> > > >     The was no kill triggered by the user. Also there are no
> > > > other limitations, neither ulimit nor in the gridengine queue
> > > >     The 'qconf -sq all.q' command gives:
> > > >     s_rt                  INFINITY
> > > >     h_rt                  INFINITY
> > > >     s_cpu                 INFINITY
> > > >     h_cpu                 INFINITY
> > > >     s_fsize               INFINITY
> > > >     h_fsize               INFINITY
> > > >     s_data                INFINITY
> > > >     h_data                INFINITY
> > > >     s_stack               INFINITY
> > > >     h_stack               INFINITY
> > > >     s_core                INFINITY
> > > >     h_core                INFINITY
> > > >     s_rss                 INFINITY
> > > >     h_rss                 INFINITY
> > > >     s_vmem                INFINITY
> > > >     h_vmem                INFINITY
> > > > 
> > > >     Years ago there were some threads about the same issue, but
> > > > i did not find a solution.
> > > > 
> > > >     Does somebody have a hint what i can do or check/debug?
> > > > 
> > > >     With kind regards and many thanks for any help, ulrich
> > > >     _______________________________________________
> > > >     users mailing list
> > > >     users at gridengine.org <mailto:users at gridengine.org>
> > > >     https://eur01.safelinks.protection.outlook.com/?url=https%3
> > > > A%2F%2Fgridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%
> > > > 7C01%7Cw.hay%40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1
> > > > faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C636934401623294243&am
> > > > p;sdata=KzIWuZo2f%2FoxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&re
> > > > served=0
> > > > 
> > > 
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > > Fgridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw
> > > .hay%40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea99
> > > 84c5b93c9210a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIW
> > > uZo2f%2FoxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0
> > > 
> > 
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
> > ridengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw.hay
> > %40ucl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea9984c5b9
> > 3c9210a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIWuZo2f%2F
> > oxmYoLNboOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgri
> dengine.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cw.hay%40u
> cl.ac.uk%7Ce04418d97df24e405add08d6d876b0a4%7C1faf88fea9984c5b93c9210
> a11d9a5c2%7C0%7C0%7C636934401623294243&sdata=KzIWuZo2f%2FoxmYoLNb
> oOFdH2LmQmwqiamPNVbe9fQYM%3D&reserved=0



More information about the users mailing list