[gridengine users] HowTo Configure large mem MPI jobs to have priority over short running serial / smp jobs?

Mike Hanby mhanby at uab.edu
Sat May 2 02:39:45 UTC 2015


Howdy,

I'm wondering if anyone in the SGE community has any tips on how to accomplish this on an SGE 6.2u5p2?

We would like to improve the balance competing of application profiles currently active on the cluster, in particular, we need to balance queue wait times between jobs that require many cores via MPI and those that use only a single core.

Currently, MPI and SMP jobs tend to wait longer (the wait time goes up as the number of slots requested) than single slot serial jobs. We have a fair share policy in place, but even with that, serial users starve out the large # of slot MPI users, especially large memory MPI jobs (ex: 64 slots at 13GB per slot jobs).

We currently require users to request h_rt and vf as manadatory resource requests. Based on our analysis of qacct, the vast majority of cluster jobs are serial  and run in less than an hour and use less that 1GB of RAM per core.  This means that these jobs, once identified, can be used very effectively to back-fill resource gaps left by larger MPI jobs.

We have a resource quota set in place to prevent slot over subscription of the compute nodes:
{
   name         slotcap
   description  Keep slots equal to processor cores for all exec hosts
   enabled      TRUE
   limit        hosts {*} to slots=$num_proc
}

Here's the proposal that was passed down to me and I'm looking for suggestions on how to implement it:

*             create a short.q to accept jobs with run times under 2 hours and 2G per slot memory.
This one is easy enough, create a queue: short.q and change h_rt in the queue definition from INFINITY to 00:02:00 and vf from INFINITY to 2G. Limit the PE list to smp

*             create a largempi.q to accept 64-core, large-memory MPI jobs that have a max runtime of 6 hours
Similar to above, largempi.q, h_rt set to 00:06:00 and vf set to 13G. Limit the PE list to MPI pe's
But, how do I make sure that jobs request a minimum of 8 slots to prevent serial jobs (i.e. no pe requested) or small parallel?

*             assign both queues to a common hardware pool that satisfies the resource needs of both job types (in our case, we'll use 22 nodes that each have 48GB and 12 slots)
Create a hostgroup containing the 22 compute nodes and assign that hostgroup to short.q and largempi.q using the "hostlist" option

*             set a user limit of 100 slots in short.q to prevent a single user from taking over the queue
Create a RQS for this:
{
   name         short_queue_limits
   description  Limit max slots for short.q
   enabled      TRUE
   limit        users {*} queues short.q to slots=100
}

Now, what am I missing to have jobs submitted to largempi.q get priority and to ensure that the serial jobs won't squeeze out the parallel large mem jobs.

Thanks,

Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20150502/a966e842/attachment.html>


More information about the users mailing list