[gridengine users] RQS and scheduler performance (max-slots-on-all-hosts)

Stuart Barkley stuartb at 4gh.net
Thu Apr 26 16:38:44 UTC 2012

While looking at my previous issue with job reservations I have
noticed a large performance issues with RQS and the scheduler.

I have previously noticed my qmaster system often running at 100% when
1000 or more jobs where in the system.  I had just assumed this was

When I set "max_reservation 8" the scheduler task takes almost 5
minutes of 100% cpu on 1 core to run (running on a KVM virtual machine
with 2 cores dedicated).  This means that priority changes can take up
to 10 minutes to be fully reflected in qstat output.

I measure the scheduler run time with 'qconf -tsm'.

starting configuration:

max_reservation 8, max-slots-on-all-hosts enabled:
  Tue Apr 24 23:21:07 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 23:25:36 2012|--------------STOP-SCHEDULER-RUN-------------
  Tue Apr 24 23:25:36 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 23:30:09 2012|--------------STOP-SCHEDULER-RUN-------------

max_reservation 0: max-slots-on-all-hosts enabled:
  Tue Apr 24 22:58:23 2012|-------------START-SCHEDULER-RUN-------------
  Tue Apr 24 22:59:22 2012|--------------STOP-SCHEDULER-RUN-------------

This is with ~1800 jobs running, ~20 jobs in 'qw' state.

I have just noticed that when I disable my max-slots-on-all-hosts RQS
the scheduling time drops significantly.

max_reservation 8, max-slots-on-all-hosts disabled:
  Thu Apr 26 11:56:24 2012|-------------START-SCHEDULER-RUN-------------
  Thu Apr 26 11:56:30 2012|--------------STOP-SCHEDULER-RUN-------------

For the record my RQS is now disabled:
   name         max-slots-on-all-hosts
   description  "Don't over commit host slots"
   enabled      FALSE
   limit        hosts {*} to slots=$num_proc

<rant>My internal logic says it should take any where near the
original time to schedule 2000 jobs.  But an awful lot of today's code
will just consume resources without good reason.  I come from a time
when compute resources where actually very expensive and people paid
attention to performance.  Now-a-days, it seems people are willing to
just throw memory and cpu at problems instead of careful

This restores my belief in the original Grid Engine coders.

(still using sge6.2u5, CentOS 5)

Stuart Barkley
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone

More information about the users mailing list