[gridengine users] PE Job Suspend / Resume

Erik Soyez E.Soyez at science-computing.de
Wed Jun 13 06:39:55 UTC 2012


Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with
HP-MPI which only needs a SIGTSTP for the master process in order to
suspend the entire job.  Regards, Erik.


On Wed, 13 Jun 2012, Rayson Ho wrote:

> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez
> <E.Soyez at science-computing.de> wrote:
>> You probably need some kind of cronjob to suspend and unsuspend your
>> parallel jobs correctly.  Or does anyone have a patch for this?
>
> Erik,
>
> So is/was it really working when you try it with SGE 6.2u5??
>
> I have not looked into the code that handles parallel job suspension
> in detail (we were working on "near-by" code in 2008 and Shannon was
> also looking into the suspending parallel jobs at that time, and thus
> we just relied on him to debug the code :-D ).
>
> However, in order to properly handle the case you metioned, the
> qmaster will need to keep track of the number of times subordination
> happens to a job. And I can already think of issues if the accounting
> code is not accurate enough.
>
> Do you know if other batch systems handle the case you mentioned correctly?
>
>
>> On Tue, 12 Jun 2012, Joseph Farran wrote:
>>
>>> Well, for our needs, we *REALLY* need Parallel Job suspension.    It's
>>> not even a choice for us.
>>>
>>> If Torque/Maui can do it, I am sure OGE can do it without issues.
>>>
>>> Can someone please tell me what patch I need to install to un-break /
>>> turn-on Parallel job suspension?
>>>
>>> If you guys are that paranoid about PE suspension, how about adding an
>>> on/off flag for this since the code is already there and let the admin pick?
>>>
>>>
>>> On 06/12/2012 06:52 AM, Dave Love wrote:
>>>>
>>>> "Joseph A. Farran"<jfarran at uci.edu>  writes:
>>>>
>>>>> If you guys are taking requests, *please* add suspension and ignore old
>>>>> Sun recommendation.
>>>>
>>>> Support for suspension exists, it's just broken (per the issue Reuti
>>>> pointed to).  The use of | is clearly wrong, but the other bit isn't
>>>> clear.  It's one of the available patches I wanted to understand before
>>>> applying (and had forgotten about).  Can anyone cast more light on it?

-- 















-- 
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196


More information about the users mailing list