[gridengine users] SGE-6.2u5: express resource ignored -> job suspends itself

Reuti reuti at staff.Uni-Marburg.DE
Wed Mar 6 11:30:22 UTC 2013


Hi,

Am 06.03.2013 um 08:09 schrieb Erik Soyez:

> Good morning,
> 
> we have a little cluster with the basic queue setup of 3 queues:
> 
> 	long
> 	regular
> 	express
> 
> We have 25 nodes, 23 of them have the queues long and regular, 2 of them
> have the queues regular and express.  "Long" is subordinate to "regular",
> "regular" is subordinate to "express".
> 
> There ist a boolean resource called "express" attached to the express
> queue on the 2 "express nodes" and to the regular queue on the 23
> "long nodes".
> 
> Express jobs are submitted with "qsub -l express".

I remember seeing it with parallel jobs going to the wrong queue for some slots despite the fact that the h_rt wasn't met and they were aborted as a result. But as sudden as I observed it, it was gone again.


> This setup works fine most of the time but it happens once in a while
> that a parallel express jobs runs in queue "regular" _and_ "express" on

For node03-06 this can indeed happen, but I thought it was fixed for 6.2u5 already.


> the same node and suspends itself, even thought queue "regular" has no
> "express" ressource (on that node):
> 
> ------------------------------------------------------------------------
> Complex values:
> express    prio    BOOL    ==    YES    NO    0    50000

What about a PE "express" instead of/in addition to the express complex? It should stay in the queue to which this PE is attached then. In case you want to use the urgency, the BOOL complex can stay attached in addition of course.


> Queue "long":
> hostlist          @long
> complex_values    express=0

It's not necessary to set it to express=0, unless you want to submit explicitly with the request "-l express=0". If you don't specify the express complex, it's not considered as a condition which needs to be matched.


> Queue "express":
> hostlist          @express
> complex_values    express=1
> 
> Queue "regular":
> hostlist          @allhosts
> complex_values    express=1,[@express=express=0]

Well, it's not possible to "unset" the complex again. maybe it would help to define it only for node03-node06 as being TRUE.

-- Reuti


> Hostgroup "@allhosts":
> @allhosts
>   @express
>      host01
>      host02
>   @long
>      host03
>      host04
>      [ ... ]
>      host25
> ------------------------------------------------------------------------
> 
> 
> - Why do jobs with "-l express" run in "regular at host01" even though it
>  does not have the express ressource attached?
> 
> - Any ideas on how to work around this problem?
> 
> 
> Many thanks!!
> 
> Erik Soyez.
> 
> 
> 
> -- 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Vorstandsvorsitzender/Chairman of the board of management:
> Gerd-Lothar Leonhart
> Vorstand/Board of Management:
> Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech
> Vorsitzender des Aufsichtsrats/
> Chairman of the Supervisory Board:
> Philippe Miltin
> Sitz/Registered Office: Tuebingen
> Registergericht/Registration Court: Stuttgart
> Registernummer/Commercial Register No.: HRB 382196
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list