[gridengine users] Advance reservations and disabled nodes/queues - new AR problem

Sabine Kreidl sabine.kreidl at uibk.ac.at
Thu Jun 14 15:36:37 UTC 2012


thanks very much for all the suggestions and sorry for my late follow
up. I have to admit, that adding disabled nodes to the AR has its
advantages for e.g. maintenance windows. It would be a very nice
feature, though, if one was able to specify the desired behavior
(respecting disabled nodes vs. omitting them from the AR) with an option
to qrsub - as a potential RFE?:-)

I currently had a (new) problem with a waiting AR for a maintenance
window. The used version on this system is SGE 6.2u3, admittedly, so
maybe this is a known and already resolved issue within newer versions:

We have two queues, only one of them - par.q - accepting parallel jobs,
i.e. associated with our defined PEs. I got the AR submitted via

    qrsub -u XXX,YYY -a 07051000 -e 07091000 -pe openmpi-* 1008

granted within par.q (default job runtimes are 10 days, so we do have
plenty of time still).
All of a sudden the available slots for all instances of par.q were set
to 0 and no parallel jobs got scheduled anymore. Accordingly, "qstat -g
c" showed a negative count for available slots in par.q (some parallel
jobs still running). As I suspected the AR, I deleted it, but a Master
restart was necessary before the default 8 cores per queue instance were
recognized again.

Does anyone have experience with such a behavior and maybe some
suggestions on how to avoid the problem?

Thanks again and best regards,
Sabine


Am 16.02.2012 01:06, schrieb Dave Love:
> William Hay <w.hay at ucl.ac.uk> writes:
>
>> We have a complex associated with every node called status that is
>> normally set to OK.  When a node has a problem we set it to a
>> description of the problem instead.   Our JSV ensures jobs always
>> request status=OK.  With a similar complex you could request status=OK
>> when making the AR.
> Yes, I think that's the only solution currently for disabled queues, but
> I'd guess it's straightforward to avoid them as an option if someone
> would like to try.  We don't currently use AR, so I haven't looked at
> it.
>
>> We also have a script that lists out nodes that aren't OK and their
>> status.  Essentially duplicating the functionality of pbsnodes under
>> Torque.  With this available as a permanent way to disable nodes we've
>> set queues to enabled at startup and use qmod -d to mean "disabled
>> till next reboot" only.
> I tag bad nodes with a comment and put them into a "testing" hostgroup
> with access only for admins (via RQS, which will be ignored for AR for a
> reason I don't follow).  I think if node user_lists were used instead of
> the RQS to restrict access, an AR would exclude the bad nodes for
> non-admins, but I'm not sure.
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20120614/dcda1d42/attachment.html>


More information about the users mailing list