[gridengine users] Making the fair-share policy/scheduler algorithm "more fair"

Mark Dixon m.c.dixon at leeds.ac.uk
Wed May 1 08:41:46 UTC 2013

On Wed, 1 May 2013, Jake Carroll wrote:

> Mark,
> Thanks for the response. This is opening up a bunch of cool ideas for us.
> We're trying to get our heads around how the scaling factor stuff actually
> "works" however.
> For example, if a host policy says scale factor for mem = 1.0, but we
> could perhaps set it to 0.50, what does that actually *mean*? How does it
> change the "scale" factor and what impact does it have on the way the
> scheduler works to utilise memory on that that node? Trying to get a
> better handle on the semantics of this thing.

The scheduler keeps the notion of the "usage" by a particular job. This is 
what is injected into the share tree and, presumably, the functional tree. 
It is just a number (visible from within qmon). What's important is the 
relative usage between your users (for functional) or relative 
cumulative usage (for share tree).

Usage is defined as a bunch of weightings multiplied by usage of cpu/mem/io 
in the units I mentioned in my last email:

   usage = (wcpu * cpu) + (wmem * mem) + (wio * io)

If you have wcpu = 0.5, wmem = 0.5, wio = 0 and you have a job on 4 slots, 
lasting for 1 day and consuming 2Gb of RAM per slot, it would generate a 
usage of:

   usage = (0.5 * 4*1*24*60*60) + (0.5 * 2*4*1*24*60*60) = 129600

What the usage actually "means" in practice depends on the weights you 
plug into the usage calculation. The weight numbers can be defined very 
precisely and so it can be difficult to decide on _exact_ values to put 
in, particularly if you have an inquisitive user or manager looking over 
your shoulder asking questions.

Personally, I put together a simple spreadsheet to play around and 
investigate the usage generated by different types of job and weights. I 
then came up with a simple model which gave a precise answer for the 
weights. I don't really care about the number of decimal places in the 
answer, but it means I can point to the spreadsheet if I'm challenged :) 
It also means anyone who wants it changed has to come up with a better 
model first :)

> For example, we have "small" node and a "large" node in the same queue,
> like so:
> complex_values        virtual_free=92G,h_vmem=92G
> complex_values        virtual_free=373G,h_vmem=373G
> So - how does the scale factor etc actually impact the schedulers use of
> the node?

usage_scaling allows you to tell the scheduler that not all slots, or RAM 
in the cluster should be considered equal.

For example, if you think that its is affective _occupancy_ of nodes that 
is important, you might want to scale the memory usage value before it 
feeds into the main usage calculation, so that you generate the same usage 
if you fill up a node, no matter how much memory that node has. In the 
case above, your second node has 4 times the amount of memory as the 
first, so you might want to use a usage_scaling of mem=0.25 for the 
second node.

Alternatively, if you have a number of clusters and don't want to bother 
mucking around with working out good values for your usage weightings all 
the time, you can use usage_scaling to normalise all your node memory 
sizes to the same value and then use the same weightings on all clusters.

Or, if you have a mixture of generally available nodes that share tree 
calculations should be done for, and nodes dedicated to specific users 
that shouldn't (e.g. they bought them), you can just stop the dedicated 
nodes from contributing to a particular user's usage via a usage_scaling 
of cpu=0.000000,mem=0.000000,io=0.000000.

All the best,

Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK

More information about the users mailing list