[gridengine users] SGE-6.2u5: express resource ignored -> job suspends itself

Erik Soyez E.Soyez at science-computing.de
Wed Mar 6 12:19:13 UTC 2013


Thanks Reuti for your quick-as-always answer!

I have added an additional express-PE for each existing PE which
should ensure that express jobs really stick to the express queue;
users submit with wildcard-PEs anyway, so they don't need to change
anything.

Erik Soyez.


On Wed, 6 Mar 2013, Reuti wrote:

> Hi,
>
> Am 06.03.2013 um 08:09 schrieb Erik Soyez:
>
>> Good morning,
>>
>> we have a little cluster with the basic queue setup of 3 queues:
>>
>> 	long
>> 	regular
>> 	express
>>
>> We have 25 nodes, 23 of them have the queues long and regular, 2 of them
>> have the queues regular and express.  "Long" is subordinate to "regular",
>> "regular" is subordinate to "express".
>>
>> There ist a boolean resource called "express" attached to the express
>> queue on the 2 "express nodes" and to the regular queue on the 23
>> "long nodes".
>>
>> Express jobs are submitted with "qsub -l express".
>
> I remember seeing it with parallel jobs going to the wrong queue for
> some slots despite the fact that the h_rt wasn't met and they were
> aborted as a result. But as sudden as I observed it, it was gone again.
>
>
>> This setup works fine most of the time but it happens once in a while
>> that a parallel express jobs runs in queue "regular" _and_ "express" on
>
> For node03-06 this can indeed happen, but I thought it was fixed for 6.2u5 already.
>
>
>> the same node and suspends itself, even thought queue "regular" has no
>> "express" ressource (on that node):
>>
>> ------------------------------------------------------------------------
>> Complex values:
>> express    prio    BOOL    ==    YES    NO    0    50000
>
> What about a PE "express" instead of/in addition to the express complex?
> It should stay in the queue to which this PE is attached then. In case
> you want to use the urgency, the BOOL complex can stay attached in
> addition of course.
>
>
>> Queue "long":
>> hostlist          @long
>> complex_values    express=0
>
> It's not necessary to set it to express=0, unless you want to submit
> explicitly with the request "-l express=0". If you don't specify the
> express complex, it's not considered as a condition which needs to be
> matched.
>
>
>> Queue "express":
>> hostlist          @express
>> complex_values    express=1
>>
>> Queue "regular":
>> hostlist          @allhosts
>> complex_values    express=1,[@express=express=0]
>
> Well, it's not possible to "unset" the complex again. maybe it would
> help to define it only for node03-node06 as being TRUE.
>
> -- Reuti
>
>
>> Hostgroup "@allhosts":
>> @allhosts
>>   @express
>>      host01
>>      host02
>>   @long
>>      host03
>>      host04
>>      [ ... ]
>>      host25
>> ------------------------------------------------------------------------
>>
>>
>> - Why do jobs with "-l express" run in "regular at host01" even though it
>>  does not have the express ressource attached?
>>
>> - Any ideas on how to work around this problem?





-- 






























-- 
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196




More information about the users mailing list