[gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti reuti at staff.uni-marburg.de
Tue Oct 30 19:07:08 UTC 2012


Am 30.10.2012 um 20:02 schrieb Joseph Farran:

> Hi Reuti.
> 
> Yes, I had that already set:
> 
> qconf -sconf|fgrep execd_params
> execd_params                 ENABLE_ADDGRP_KILL=TRUE
> 
> What is strange is that 1 out of 10 jobs or so do get killed just fine when they go past the hard wall time clock.
> 
> However, the majority of the jobs are not being killed when they go past their wall time clock.
> 
> How can I investigate this further?

ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500

(f w/o -) and post the relevant lines of the application please.

-- Reuti


> 
> 
> On 10/30/2012 11:44 AM, Reuti wrote:
>> Hi,
>> 
>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>> 
>>> I google this issue but did not see much help on the subject.
>>> 
>>> I have several queues with hard wall clock limits like this one:
>>> 
>>> # qconf -sq queue  | grep h_rt
>>> h_rt                  96:00:00
>>> 
>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall clock limit and continue to run.
>>> 
>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>> 
>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since 42318s
>> Maybe they jumped out of the process tree (usually jobs are killed by `kill -9 -- -pgrp`. You can kill them by their additional group id, which is attached to all started processes even if the executed something like `setsid`:
>> 
>> $ qconf -sconf
>> ...
>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>> 
>> If it's still not working, we have to investigate the process tree.
>> 
>> HTH - Reuti
>> 
>> 
>>> These entries correspond to the running jobs that should have ended 96 hours ago, but they keep on running.
>>> 
>>> Why is GE not killing these jobs correctly when they run past the 96 hour limit but yet complains they should have ended?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 





More information about the users mailing list