[gridengine users] Over-subscription of hosts -- the role of slots and queues

Reuti reuti at staff.uni-marburg.de
Tue Jun 5 18:37:17 UTC 2012


Am 05.06.2012 um 20:23 schrieb Andrew Pearson:

> Hi all
> 
> I'm having an oversubscription problem on my cluster.  I'll describe the problem and my proposed solution.  I can't implement my solution yet since there are some several-day jobs running right now, so I thought I'd run it past everyone on the mailing list.
> 
> My problem is simple - parallel jobs submitted to the cluster are using processor cores that are already occupied by a previously submitted batch job.  It's not clear the the parallel/batch distinction is important, but I've been running multiple simultaneous parallel jobs for a while on my current configuration and this problem has never come up before.
> 
> My solution assumes that infact the problem has nothing to do with parallel/batch.  Rather, it is happening because I have two overlapping queues:  all.q that uses nodes 0 through 10, and all_small.q that uses nodes 9 and 10.  The parallel job runs in all.q, while the batch job runs in all_small.q.  Since both queues have slots=16 (16 processors per node), then nodes 9 and 10 effectively have 32 slots each.  If this is true (that's the question), then all I have to do is change my queues so that they don't overlap.

Either this, or:

a) define slots=16 in the exechosts definition for "complex_values" (`qconf -me node09` resp. node10)

b) define an RQS: limit hosts {node09,node10} to slots=16

to limit the overall consumption across all queues residing on an exechost.

-- Reuti


>  The fact that the problem doesn't come up with multiple parallel jobs may be because of load thresholds.
> 
> What do you think of my solution?  If it's nonsense, can anyone suggest what the problem may be?
> 
> Thank you.
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list