[gridengine users] PE Job Suspend / Resume

Erik Soyez E.Soyez at science-computing.de
Wed Jun 13 09:11:54 UTC 2012


Hi Reuti, that's why it is SIGTSTP, not SIGSTOP.  Erik Soyez.


On Wed, 13 Jun 2012, Reuti wrote:

> Am 13.06.2012 um 08:39 schrieb Erik Soyez:
>
>> Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with
>> HP-MPI which only needs a SIGTSTP for the master process in order to
>> suspend the entire job.  Regards, Erik.
>
> How does this work? Usually the sigstop can't be trapped. So, are the 
> other processes on the slave nodes stopping theirselfs as some kind of 
> heartbeat is missing as the master process is already stapped? Lateron 
> on a sigcont the master process will have to wake them up again by 
> distributing the signal of course.
>
>
>> On Wed, 13 Jun 2012, Rayson Ho wrote:
>>
>>> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez
>>> <E.Soyez at science-computing.de> wrote:
>>>> You probably need some kind of cronjob to suspend and unsuspend your
>>>> parallel jobs correctly.  Or does anyone have a patch for this?
>>>
>>> Erik,
>>>
>>> So is/was it really working when you try it with SGE 6.2u5??
>>>
>>> I have not looked into the code that handles parallel job suspension
>>> in detail (we were working on "near-by" code in 2008 and Shannon was
>>> also looking into the suspending parallel jobs at that time, and thus
>>> we just relied on him to debug the code :-D ).
>>>
>>> However, in order to properly handle the case you metioned, the
>>> qmaster will need to keep track of the number of times subordination
>>> happens to a job. And I can already think of issues if the accounting
>>> code is not accurate enough.
>>>
>>> Do you know if other batch systems handle the case you mentioned correctly?
>>>
>>>
>>>> On Tue, 12 Jun 2012, Joseph Farran wrote:
>>>>
>>>>> Well, for our needs, we *REALLY* need Parallel Job suspension.    It's
>>>>> not even a choice for us.
>>>>>
>>>>> If Torque/Maui can do it, I am sure OGE can do it without issues.
>>>>>
>>>>> Can someone please tell me what patch I need to install to un-break /
>>>>> turn-on Parallel job suspension?
>>>>>
>>>>> If you guys are that paranoid about PE suspension, how about adding an
>>>>> on/off flag for this since the code is already there and let the admin pick?
>>>>>
>>>>>
>>>>> On 06/12/2012 06:52 AM, Dave Love wrote:
>>>>>>
>>>>>> "Joseph A. Farran"<jfarran at uci.edu>  writes:
>>>>>>
>>>>>>> If you guys are taking requests, *please* add suspension and ignore old
>>>>>>> Sun recommendation.
>>>>>>
>>>>>> Support for suspension exists, it's just broken (per the issue Reuti
>>>>>> pointed to).  The use of | is clearly wrong, but the other bit isn't
>>>>>> clear.  It's one of the available patches I wanted to understand before
>>>>>> applying (and had forgotten about).  Can anyone cast more light on it?


-- 






















-- 
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196




More information about the users mailing list