[gridengine users] PE Job Suspend / Resume

Reuti reuti at staff.uni-marburg.de
Wed Jun 13 09:23:39 UTC 2012


Hi,

Am 13.06.2012 um 11:11 schrieb Erik Soyez:

> Hi Reuti, that's why it is SIGTSTP, not SIGSTOP.  Erik Soyez.

aha, and this one can be defined in suspend_method then.

-- Reuti


> On Wed, 13 Jun 2012, Reuti wrote:
> 
>> Am 13.06.2012 um 08:39 schrieb Erik Soyez:
>> 
>>> Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with
>>> HP-MPI which only needs a SIGTSTP for the master process in order to
>>> suspend the entire job.  Regards, Erik.
>> 
>> How does this work? Usually the sigstop can't be trapped. So, are the other processes on the slave nodes stopping theirselfs as some kind of heartbeat is missing as the master process is already stopped? Lateron on a sigcont the master process will have to wake them up again by distributing the signal of course.
>> 
>> 
>>> On Wed, 13 Jun 2012, Rayson Ho wrote:
>>> 
>>>> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez
>>>> <E.Soyez at science-computing.de> wrote:
>>>>> You probably need some kind of cronjob to suspend and unsuspend your
>>>>> parallel jobs correctly.  Or does anyone have a patch for this?
>>>> 
>>>> Erik,
>>>> 
>>>> So is/was it really working when you try it with SGE 6.2u5??
>>>> 
>>>> I have not looked into the code that handles parallel job suspension
>>>> in detail (we were working on "near-by" code in 2008 and Shannon was
>>>> also looking into the suspending parallel jobs at that time, and thus
>>>> we just relied on him to debug the code :-D ).
>>>> 
>>>> However, in order to properly handle the case you metioned, the
>>>> qmaster will need to keep track of the number of times subordination
>>>> happens to a job. And I can already think of issues if the accounting
>>>> code is not accurate enough.
>>>> 
>>>> Do you know if other batch systems handle the case you mentioned correctly?
>>>> 
>>>> 
>>>>> On Tue, 12 Jun 2012, Joseph Farran wrote:
>>>>> 
>>>>>> Well, for our needs, we *REALLY* need Parallel Job suspension.    It's
>>>>>> not even a choice for us.
>>>>>> 
>>>>>> If Torque/Maui can do it, I am sure OGE can do it without issues.
>>>>>> 
>>>>>> Can someone please tell me what patch I need to install to un-break /
>>>>>> turn-on Parallel job suspension?
>>>>>> 
>>>>>> If you guys are that paranoid about PE suspension, how about adding an
>>>>>> on/off flag for this since the code is already there and let the admin pick?
>>>>>> 
>>>>>> 
>>>>>> On 06/12/2012 06:52 AM, Dave Love wrote:
>>>>>>> 
>>>>>>> "Joseph A. Farran"<jfarran at uci.edu>  writes:
>>>>>>> 
>>>>>>>> If you guys are taking requests, *please* add suspension and ignore old
>>>>>>>> Sun recommendation.
>>>>>>> 
>>>>>>> Support for suspension exists, it's just broken (per the issue Reuti
>>>>>>> pointed to).  The use of | is clearly wrong, but the other bit isn't
>>>>>>> clear.  It's one of the available patches I wanted to understand before
>>>>>>> applying (and had forgotten about).  Can anyone cast more light on it?
> 
> 
> -- 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vorstandsvorsitzender/Chairman of the board of management:
> Gerd-Lothar Leonhart
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Philippe Miltin
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list