[gridengine users] limit slots to core count no longer works

John Young j.e.young at larc.nasa.gov
Tue Apr 14 19:32:17 UTC 2015


Hello,

   We (fairly) recently upgraded our cluster to Rocks 6.1.1
and we now seem to be having problems with RQS.  On our old
cluster, we had an RQS quota set as follows:

{
   name         host-slots
   description  restrict slots to core count
   enabled      TRUE
   limit        hosts {*} to slots=$num_proc
}

The reason for this was to try to prevent oversubscription
of the processors on the clients.  Now, if I have this quota
enabled, jobs that are submitted don't start and if I do a
'qstat -j job-number' under "scheduling info" I see things like

cannot run because it exceeds limit "////compute-0-7/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-7/" in rule "host-slots/1"
(-l slots=1) cannot run in queue "compute-0-39.local" because it offers only hc:slots=0.000000
cannot run because it exceeds limit "////compute-0-78/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-78/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-55/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-55/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-74/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-74/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-2-7/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-2-1/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-2-2/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-22/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-0-22/" in rule "host-slots/1"
cannot run because it exceeds limit "////compute-1-2/" in rule "host-slots/1"
cannot run in PE "mpich" because it only offers 0 slots

But as soon as I run 'qconf -mrqs' and change TRUE to FALSE, the job runs.

Has the process for preventing oversubscription changed?  Any ideas?

JY




More information about the users mailing list