[gridengine users] RQS and scheduler performance (max-slots-on-all-hosts)

Stuart Barkley stuartb at 4gh.net
Thu Apr 26 16:38:44 UTC 2012


While looking at my previous issue with job reservations I have
noticed a large performance issues with RQS and the scheduler.

I have previously noticed my qmaster system often running at 100% when
1000 or more jobs where in the system.  I had just assumed this was
normal.

When I set "max_reservation 8" the scheduler task takes almost 5
minutes of 100% cpu on 1 core to run (running on a KVM virtual machine
with 2 cores dedicated).  This means that priority changes can take up
to 10 minutes to be fully reflected in qstat output.

I measure the scheduler run time with 'qconf -tsm'.

starting configuration:

max_reservation 8, max-slots-on-all-hosts enabled:
  Tue Apr 24 23:21:07 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 23:25:36 2012|--------------STOP-SCHEDULER-RUN-------------
  Tue Apr 24 23:25:36 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 23:30:09 2012|--------------STOP-SCHEDULER-RUN-------------

max_reservation 0: max-slots-on-all-hosts enabled:
  Tue Apr 24 22:58:23 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 22:59:22 2012|--------------STOP-SCHEDULER-RUN-------------

This is with ~1800 jobs running, ~20 jobs in 'qw' state.

I have just noticed that when I disable my max-slots-on-all-hosts RQS
the scheduling time drops significantly.

max_reservation 8, max-slots-on-all-hosts disabled:
  Thu Apr 26 11:56:24 2012|-------------START-SCHEDULER-RUN-------------
  Thu Apr 26 11:56:30 2012|--------------STOP-SCHEDULER-RUN-------------

For the record my RQS is now disabled:
{
   name         max-slots-on-all-hosts
   description  "Don't over commit host slots"
   enabled      FALSE
   limit        hosts {*} to slots=$num_proc
}

<rant>My internal logic says it should take any where near the
original time to schedule 2000 jobs.  But an awful lot of today's code
will just consume resources without good reason.  I come from a time
when compute resources where actually very expensive and people paid
attention to performance.  Now-a-days, it seems people are willing to
just throw memory and cpu at problems instead of careful
programming.</rant>

This restores my belief in the original Grid Engine coders.

(still using sge6.2u5, CentOS 5)

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone



More information about the users mailing list