[gridengine users] difference between a task reschedule and a task kill in the epilog?

Reuti reuti at staff.uni-marburg.de
Wed Apr 4 15:50:14 UTC 2012


Am 04.04.2012 um 17:42 schrieb Lars van der bijl:

> Hey Reuti
> 
> On 4 April 2012 17:14, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Well, in both cases it is killed of course. You could set loglevel to log_info and search the messages file of the qmaster for entries like:
>> 
>> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 rescheduling because: manual/auto rescheduling
>> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1
>> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396
> 
> might have to rotate the file before i try and do something like that,
> it's currently 117Mb.
> 
>> 
>> Then you can act on this. Do you have this often, that you want to reschedule a job? I wonder whether using a checkpointing environment would help (also if we don't intend to use any checkpointing at all). There you can have a procedure for migration in migr_command.
> 
> no it's not something I want to happen often but it happens. one thing
> i'm still struggling with on a related note is that a task will keep
> running even after it is rescheduled. making both of the outputs
> useless.
> 
> would we be able to make sure the task is kill -9'd (and it's sub

The default behavior in SGE is:

# kill -9 -- -pid

This will kill the complete process group due to its negative value. The problem of surviving kids should have been fixed since 6.2u3 as I found recently but sometimes it's still there.


> pids) if it's rescheduled using a checkpointing?

In fact: you have to do it on your own. SGE will start the migr_command and you have to checkpoint by any means and then kill all processes on your own. You can have a look at my Howto:

http://arc.liv.ac.uk/SGE/howto/checkpointing.html

and example5 therein. To reschedule a job would then mean to suspend it from the command line which will start the migr_command.

-- Reuti


>> -- Reuti
>> 
>> 
>> Am 04.04.2012 um 16:33 schrieb Lars van der bijl:
>> 
>>> is there a way to tell the difference?
>>> 
>>> if i reschedual a job i get these values in the usage file in the epilog
>>> 
>>> wait_status=3727362
>>> exit_status=137
>>> signal=9
>>> start_time=1333549517
>>> end_time=1333549565
>>> ru_wallclock=48
>>> ru_utime=0.226965
>>> ru_stime=0.306953
>>> ru_maxrss=5408
>>> ru_ixrss=0
>>> ru_idrss=0
>>> ru_isrss=0
>>> ru_minflt=40792
>>> ru_majflt=5
>>> ru_nswap=0
>>> ru_inblock=7992
>>> ru_oublock=232
>>> ru_msgsnd=0
>>> ru_msgrcv=0
>>> ru_nsignals=0
>>> ru_nvcsw=3489
>>> ru_nivcsw=113
>>> 
>>> if i kill the job I get this.
>>> 
>>> wait_status=3727362
>>> exit_status=137
>>> signal=9
>>> start_time=1333549704
>>> end_time=1333549719
>>> ru_wallclock=15
>>> ru_utime=0.196970
>>> ru_stime=0.196970
>>> ru_maxrss=5412
>>> ru_ixrss=0
>>> ru_idrss=0
>>> ru_isrss=0
>>> ru_minflt=40459
>>> ru_majflt=0
>>> ru_nswap=0
>>> ru_inblock=0
>>> ru_oublock=232
>>> ru_msgsnd=0
>>> ru_msgrcv=0
>>> ru_nsignals=0
>>> ru_nvcsw=705
>>> ru_nivcsw=149
>>> 
>>> anyone know of a way to tell the difference from the epilog?
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 





More information about the users mailing list