[gridengine users] Where do the factors for np_load_short come from?

Tim Landscheidt tim at tim-landscheidt.de
Thu May 16 17:18:20 UTC 2013


Hi,

we're using OGS/GE 2011.11 at toolserver.org, and unfortu-
nately our admins are AWOL.  I'm trying to investigate why
the grid is heavily underloaded (while queues are filling
up).  A simple job gets queued and has scheduling_info:

| scheduling info:            queue instance "longrun-sol at willow.toolserver.org" dropped because it is temporarily not available
|                             queue instance "short-sol at willow.toolserver.org" dropped because it is temporarily not available
|                             queue instance "medium-lx at mayapple.toolserver.org" dropped because it is temporarily not available
|                             queue instance "longrun3-sol at willow.toolserver.org" dropped because it is temporarily not available
|                             queue instance "longrun2-sol at clematis.toolserver.org" dropped because it is disabled
|                             queue instance "longrun2-sol at hawthorn.toolserver.org" dropped because it is disabled
|                             queue instance "medium-sol at ortelius.toolserver.org" dropped because it is overloaded: np_load_short=0.845508 (= 0.645508 + 0.8 * 1.000000 with nproc=4) >= 0.75
|                             queue instance "medium-sol at wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=0.831445 (= 0.231445 + 0.8 * 6.000000 with nproc=8) >= 0.75
|                             queue instance "short-sol at ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.245508 (= 0.645508 + 0.8 * 3.000000 with nproc=4) >= 1.2
|                             queue instance "short-sol at wolfsbane.toolserver.org" dropped because it is overloaded: np_load_short=1.231445 (= 0.231445 + 0.8 * 10.000000 with nproc=8) >= 1.2
|                             queue instance "medium-lx at yarrow.toolserver.org" dropped because it is overloaded: np_load_short=1.202500 (= 0.002500 + 0.8 * 6.000000 with nproc=4) >= 1.2
|                             queue instance "medium-lx at nightshade.toolserver.org" dropped because it is full
|                             queue instance "longrun-lx at nightshade.toolserver.org" dropped because it is overloaded: mem_free=-173461503.737856 (= 13834.574219M - 500M * 28.000000) <= 500
|                             queue instance "longrun-lx at yarrow.toolserver.org" dropped because it is overloaded: np_load_short=3.202500 (= 0.002500 + 0.8 * 16.000000 with nproc=4) >= 3.1

For example, in queue instance
medium-lx at yarrow.toolserver.org, where do the factors 0.8
and 6.000000 come from?  Neither "qconf -sconf global" nor
"qconf -sconf yarrow" show anything obvious, and "qconf -sq
medium-lx" only has load_thresholds with:

| [...]
| load_thresholds       np_load_short=1.2,np_load_long=1.5,cpu=98, \
|                       mem_free=1000M, \
|                       [mayapple.toolserver.org=np_load_short=2.1,mem_free=300M]
| [...]

to define the threshold, but not the calculation.  I believe
the factors are applied at
source/libs/sched/sge_select_queue.c:2057, but I don't want
to read the whole source :-).  Are these factors some de-
fault, or where should I look?

TIA,
Tim



More information about the users mailing list