[gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month

Yuri Burmachenko yuribu at mellanox.com
Tue Mar 15 07:11:53 UTC 2016


Hello Reuti,

The spool directory is shared via NFS share between qmaster and shadow servers.

Thanks.

-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Monday, March 14, 2016 11:43 AM
To: Yuri Burmachenko <yuribu at mellanox.com>
Cc: users at gridengine.org; Dmitry Leibovich <dmitryl at mellanox.com>
Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month

Hi,

> Am 13.03.2016 um 08:53 schrieb Yuri Burmachenko <yuribu at mellanox.com>:
> 
> Hello Reuti,
> 
> We will try that, but we have also found another issue.
> 
> We see that also our SoGE fails and  failovers from master to shadow and vice-versa during the same time when this switch in Job ID occur:

The spool directory is shared between the qmaster and shadow daemons?

-- Reuti


> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster hard descriptor limit 
> is set to 8192
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster soft descriptor limit 
> is set to 8192
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will use max. 8172 
> file descriptors for communication
> 03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will accept max. 950 
> dynamic event clients
> 03/13/2016 08:04:21|  main|mtlxsge001|I|starting up SGE 8.1.8 
> (lx-amd64)
> 
> From qacct:
> jobnumber    351331              
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:04 2016
> jobnumber    351488              
> start_time   Sun Mar 13 08:04:34 2016
> end_time     Sun Mar 13 08:05:05 2016
> jobnumber    351511              
> start_time   Sun Mar 13 08:04:54 2016
> end_time     Sun Mar 13 08:05:05 2016
> jobnumber    351410              
> start_time   Sun Mar 13 08:04:29 2016
> end_time     Sun Mar 13 08:05:07 2016
> jobnumber    351355              
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:07 2016
> jobnumber    351502              
> start_time   Sun Mar 13 08:04:49 2016
> end_time     Sun Mar 13 08:05:08 2016
> jobnumber    9999253             
> start_time   Sun Mar 13 08:04:56 2016
> end_time     Sun Mar 13 08:05:08 2016
> start_time   Sun Mar 13 08:04:28 2016
> end_time     Sun Mar 13 08:05:53 2016
> jobnumber    9999337             
> start_time   Sun Mar 13 08:05:43 2016
> end_time     Sun Mar 13 08:05:53 2016
> jobnumber    9999254             
> start_time   Sun Mar 13 08:04:56 2016
> end_time     Sun Mar 13 08:05:57 2016
> 
> There is a correlation in times between the job ID switch and SoGE failure and further failover to another node.
> 
> Basically now we need to understand why the SoGE fails...
> 
> Will appreciate on any tips and advices on this.
> Thank You.
> 
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, March 08, 2016 2:25 PM
> To: Yuri Burmachenko <yuribu at mellanox.com>
> Cc: users at gridengine.org
> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset 
> very fast 9999999 ==> 1 - 6-7 times in a month
> 
> 
>> Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yuribu at mellanox.com>:
>> 
>> Hello Reuti,
>> 
>> See below:
>> 
>> Job ID		Job schedule time
>> 97453		29-02-2016_03:18:55
>> 97454		29-02-2016_03:18:57
>> 9999563	29-02-2016_03:23:44
>> 9999564	29-02-2016_03:23:44
>> 9999565	29-02-2016_03:23:44
>> ....
>> 9999999	29-02-2016_03:27:34
>> 1		29-02-2016_03:27:35
>> 
>> Any idea what could be the root cause and/or where to look?
> 
> Interesting. One could try `incron` to spot any access to the file "jobseqnum".
> 
> -- Reuti
> 
> 
>> 
>> Thanks.
>> 
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Sunday, March 06, 2016 7:27 PM
>> To: Yuri Burmachenko <yuribu at mellanox.com>
>> Cc: users at gridengine.org
>> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset 
>> very fast 9999999 ==> 1 - 6-7 times in a month
>> 
>> Hi,
>> 
>> Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko:
>> 
>>> Hallo to distinguished forum members,
>>> 
>>> Recently we have found that something is wrong with SGE Job IDs - they are getting reset very fast: 6-7 times in a month.
>>> We don't really have so many jobs executed in such a short period of time.
>>> 
>>> We use JobId (via qacct) as a primary key for different home-made analytic tools, and this very quick jobId switch impairs the reliability of the tools.
>>> 
>>> This started after we had a full electricity shutdown during which we have halted all our systems including SGE master/shadow and its execution hosts.
>> 
>> To elaborate this. When it suddenly jumps to 99999999: what was the highest JOB_ID which was recorded before that skip in the accounting file?
>> 
>> -- Reuti
>> 
>> 
>>> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to "9999999" and then something (related or not) restarts SGE setting that jobid.
>>> 
>>> Any tips and advices where to look for the root cause, will be greatly appreciated.
>>> Thank You.
>>> 
>>> 
>>> 
>>> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd.
>>> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 
>>> Follow us on Twitter and Facebook
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 





More information about the users mailing list