[gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month

Yuri Burmachenko yuribu at mellanox.com
Sun Mar 13 07:53:47 UTC 2016


Hello Reuti,

We will try that, but we have also found another issue.

We see that also our SoGE fails and  failovers from master to shadow and vice-versa during the same time when this switch in Job ID occur:

03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster hard descriptor limit is set to 8192
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster soft descriptor limit is set to 8192
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will use max. 8172 file descriptors for communication
03/13/2016 08:04:21|  main|mtlxsge001|I|qmaster will accept max. 950 dynamic event clients
03/13/2016 08:04:21|  main|mtlxsge001|I|starting up SGE 8.1.8 (lx-amd64)

>From qacct:
jobnumber    351331              
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:04 2016
jobnumber    351488              
start_time   Sun Mar 13 08:04:34 2016
end_time     Sun Mar 13 08:05:05 2016
jobnumber    351511              
start_time   Sun Mar 13 08:04:54 2016
end_time     Sun Mar 13 08:05:05 2016
jobnumber    351410              
start_time   Sun Mar 13 08:04:29 2016
end_time     Sun Mar 13 08:05:07 2016
jobnumber    351355              
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:07 2016
jobnumber    351502              
start_time   Sun Mar 13 08:04:49 2016
end_time     Sun Mar 13 08:05:08 2016
jobnumber    9999253             
start_time   Sun Mar 13 08:04:56 2016
end_time     Sun Mar 13 08:05:08 2016
start_time   Sun Mar 13 08:04:28 2016
end_time     Sun Mar 13 08:05:53 2016
jobnumber    9999337             
start_time   Sun Mar 13 08:05:43 2016
end_time     Sun Mar 13 08:05:53 2016
jobnumber    9999254             
start_time   Sun Mar 13 08:04:56 2016
end_time     Sun Mar 13 08:05:57 2016

There is a correlation in times between the job ID switch and SoGE failure and further failover to another node.

Basically now we need to understand why the SoGE fails...

Will appreciate on any tips and advices on this.
Thank You.


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Tuesday, March 08, 2016 2:25 PM
To: Yuri Burmachenko <yuribu at mellanox.com>
Cc: users at gridengine.org
Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset very fast 9999999 ==> 1 - 6-7 times in a month


> Am 08.03.2016 um 10:59 schrieb Yuri Burmachenko <yuribu at mellanox.com>:
> 
> Hello Reuti,
> 
> See below:
> 
> Job ID		Job schedule time
> 97453		29-02-2016_03:18:55
> 97454		29-02-2016_03:18:57
> 9999563	29-02-2016_03:23:44
> 9999564	29-02-2016_03:23:44
> 9999565	29-02-2016_03:23:44
> ....
> 9999999	29-02-2016_03:27:34
> 1		29-02-2016_03:27:35
> 
> Any idea what could be the root cause and/or where to look?

Interesting. One could try `incron` to spot any access to the file "jobseqnum".

-- Reuti


> 
> Thanks.
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Sunday, March 06, 2016 7:27 PM
> To: Yuri Burmachenko <yuribu at mellanox.com>
> Cc: users at gridengine.org
> Subject: Re: [gridengine users] SoGE 8.1.8 - Job IDs getting reset 
> very fast 9999999 ==> 1 - 6-7 times in a month
> 
> Hi,
> 
> Am 06.03.2016 um 18:04 schrieb Yuri Burmachenko:
> 
>> Hallo to distinguished forum members,
>> 
>> Recently we have found that something is wrong with SGE Job IDs - they are getting reset very fast: 6-7 times in a month.
>> We don't really have so many jobs executed in such a short period of time.
>> 
>> We use JobId (via qacct) as a primary key for different home-made analytic tools, and this very quick jobId switch impairs the reliability of the tools.
>> 
>> This started after we had a full electricity shutdown during which we have halted all our systems including SGE master/shadow and its execution hosts.
> 
> To elaborate this. When it suddenly jumps to 99999999: what was the highest JOB_ID which was recorded before that skip in the accounting file?
> 
> -- Reuti
> 
> 
>> Perhaps something sets $SGE_ROOT/default/spool/qmaster/jobseqnum to "9999999" and then something (related or not) restarts SGE setting that jobid.
>> 
>> Any tips and advices where to look for the root cause, will be greatly appreciated.
>> Thank You.
>> 
>> 
>> 
>> Yuri Burmachenko | Sr. Engineer | IT | Mellanox Technologies Ltd.
>> Work: +972 74 7236386 | Cell +972 54 7542188 |Fax: +972 4 959 3245 
>> Follow us on Twitter and Facebook
>> 
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 





More information about the users mailing list