[gridengine users] Over-subscription of hosts -- the role of slots and queues
andrew.j.pearson at gmail.com
Tue Jun 5 18:23:09 UTC 2012
I'm having an oversubscription problem on my cluster. I'll describe the
problem and my proposed solution. I can't implement my solution yet since
there are some several-day jobs running right now, so I thought I'd run it
past everyone on the mailing list.
My problem is simple - parallel jobs submitted to the cluster are using
processor cores that are already occupied by a previously submitted batch
job. It's not clear the the parallel/batch distinction is important, but
I've been running multiple simultaneous parallel jobs for a while on my
current configuration and this problem has never come up before.
My solution assumes that infact the problem has nothing to do with
parallel/batch. Rather, it is happening because I have two overlapping
queues: all.q that uses nodes 0 through 10, and all_small.q that uses
nodes 9 and 10. The parallel job runs in all.q, while the batch job runs
in all_small.q. Since both queues have slots=16 (16 processors per node),
then nodes 9 and 10 effectively have 32 slots each. If this is true
(that's the question), then all I have to do is change my queues so that
they don't overlap. The fact that the problem doesn't come up with
multiple parallel jobs may be because of load thresholds.
What do you think of my solution? If it's nonsense, can anyone suggest
what the problem may be?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users