[gridengine users] limit slots to core count no longer works

Reuti reuti at staff.uni-marburg.de
Wed Apr 15 10:32:40 UTC 2015


Hi,

> Am 14.04.2015 um 21:32 schrieb John Young <j.e.young at larc.nasa.gov>:
> 
> Hello,
> 
>   We (fairly) recently upgraded our cluster to Rocks 6.1.1
> and we now seem to be having problems with RQS.  On our old
> cluster, we had an RQS quota set as follows:
> 
> {
>   name         host-slots
>   description  restrict slots to core count
>   enabled      TRUE
>   limit        hosts {*} to slots=$num_proc
> }
> 
> The reason for this was to try to prevent oversubscription
> of the processors on the clients.  Now, if I have this quota
> enabled, jobs that are submitted don't start and if I do a
> 'qstat -j job-number' under "scheduling info" I see things like
> 
> cannot run because it exceeds limit "////compute-0-7/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-7/" in rule "host-slots/1"
> (-l slots=1) cannot run in queue "compute-0-39.local" because it offers only hc:slots=0.000000
> cannot run because it exceeds limit "////compute-0-78/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-78/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-55/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-55/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-74/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-74/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-2-7/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-2-1/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-2-2/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-22/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-0-22/" in rule "host-slots/1"
> cannot run because it exceeds limit "////compute-1-2/" in rule "host-slots/1"
> cannot run in PE "mpich" because it only offers 0 slots
> 
> But as soon as I run 'qconf -mrqs' and change TRUE to FALSE, the job runs.
> 
> Has the process for preventing oversubscription changed?  Any ideas?

Well, I noticed this too from time to time - it may disappear at one point again. I would judge it a bug in that version of SGE.

-- Reuti



More information about the users mailing list