[gridengine users] Understanding load_formula and load calculations for queue overloads..

Ben Daniel Pere ben.pere at gmail.com
Mon Feb 29 22:27:25 UTC 2016


>
> It's the other way round. The load used in the load_formula is already
> adjusted. You adjust individual values, not the result of any computation
> already made with them.
>
> The computed load_formula will then be used to sort the machines.
>

Oh load formula is just for machines priority? so I do see the sense in
normalizing this load by the number of cores (otherwise we'll kill machines
with 24 cores while machines with 56 cores are barely doing anything) - and
I suppose that's exactly what the default "np_load_avg" does.. awesome!

> we basically have 2 kinds of queue - a workhorse queue "all.q" which has
> 1 slot per core and an interactive queue which also has 1 slot per core but
> gets a better priority. we set the load_thresholds to 1.3 to allow 30%
> oversubscription to ensure interactive jobs can always run.. we never ever
> put our nodes in alarm mode, we use zabbix to monitor machine's health and
> we automatically take it out of the cluster (by disabling all of it's
> queues) in cases of "mess" (disk failures, out of space, mounting issues,
> stuff like that).
>
> Are these interactive job generating load, is it used only to allow users
> to peek on a machine?
>

yes they're generating load, but there aren't many of them and they are
usually very short (seconds to minute-ish), absolutley all our tasks
single-threaded, 100% cpu taking.. we work super hard to relieve other
bottlenecks (filesystem, databases, etc) - doesn't always work perfectly
but for most of our tasks, cpu is our only boundary.
Our cluster is 50 execution hosts, each with 128-256GB RAM and 24-56 cores,
and we have some "support" hardware like an fhgfs cluster for information
not on local disks, mysql servers, etc - we intend to double the size of
the cluster this year and we're preparing by making uses of our "shared"
resources (database, fhgfs-storage) more efficient and by looking at our
sge configuration and trying to figure out what we're doing wrong =) the
most common complaint in our halls is that the cluster isn't responsive
enough so we've created a cluster task force that tried to tackle some
issues - I'm a software engineer but helping with fhgfs and sge
configuration as well, so you're probably going to hear a lot from me soon
;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160301/6bd45a5a/attachment.html>


More information about the users mailing list