[gridengine users] RQS and scheduler performance (max-slots-on-all-hosts)

Reuti reuti at staff.uni-marburg.de
Thu Apr 26 16:57:19 UTC 2012


Am 26.04.2012 um 18:38 schrieb Stuart Barkley:

> While looking at my previous issue with job reservations I have
> noticed a large performance issues with RQS and the scheduler.
> 
> I have previously noticed my qmaster system often running at 100% when
> 1000 or more jobs where in the system.  I had just assumed this was
> normal.
> 
> When I set "max_reservation 8" the scheduler task takes almost 5
> minutes of 100% cpu on 1 core to run (running on a KVM virtual machine
> with 2 cores dedicated).  This means that priority changes can take up
> to 10 minutes to be fully reflected in qstat output.
> 
> I measure the scheduler run time with 'qconf -tsm'.
> 
> starting configuration:
> 
> max_reservation 8, max-slots-on-all-hosts enabled:
>  Tue Apr 24 23:21:07 2012|-------------START-SCHEDULER-RUN-------------
>  Tue Apr 24 23:25:36 2012|--------------STOP-SCHEDULER-RUN-------------
>  Tue Apr 24 23:25:36 2012|-------------START-SCHEDULER-RUN-------------
>  Tue Apr 24 23:30:09 2012|--------------STOP-SCHEDULER-RUN-------------
> 
> max_reservation 0: max-slots-on-all-hosts enabled:
>  Tue Apr 24 22:58:23 2012|-------------START-SCHEDULER-RUN-------------
>  Tue Apr 24 22:59:22 2012|--------------STOP-SCHEDULER-RUN-------------
> 
> This is with ~1800 jobs running, ~20 jobs in 'qw' state.
> 
> I have just noticed that when I disable my max-slots-on-all-hosts RQS
> the scheduling time drops significantly.
> 
> max_reservation 8, max-slots-on-all-hosts disabled:
>  Thu Apr 26 11:56:24 2012|-------------START-SCHEDULER-RUN-------------
>  Thu Apr 26 11:56:30 2012|--------------STOP-SCHEDULER-RUN-------------
> 
> For the record my RQS is now disabled:
> {
>   name         max-slots-on-all-hosts
>   description  "Don't over commit host slots"
>   enabled      FALSE
>   limit        hosts {*} to slots=$num_proc
> }
> 
> <rant>My internal logic says it should take any where near the
> original time to schedule 2000 jobs.  But an awful lot of today's code
> will just consume resources without good reason.  I come from a time
> when compute resources where actually very expensive and people paid
> attention to performance.  Now-a-days, it seems people are willing to
> just throw memory and cpu at problems instead of careful
> programming.</rant>

+1

The hardware is getting faster, but the software slower. In the end you get the same speed. ;-)

What was your "schedule_interval" set to?

Was "schedd_job_info true" set by accident?

-- Reuti


> This restores my belief in the original Grid Engine coders.
> 
> (still using sge6.2u5, CentOS 5)
> 
> Stuart Barkley
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list