[gridengine users] Strange behavior with functional scheduling
darose at darose.net
Tue Oct 10 15:02:20 UTC 2017
On 2017-10-09 6:23 pm, Reuti wrote:
> Am 10.10.2017 um 00:00 schrieb David Rosenstrauch:
>> On 2017-10-09 5:45 pm, Reuti wrote:
>>> Am 09.10.2017 um 23:01 schrieb David Rosenstrauch:
>>>> I'm a bit of a SGE noob, so please bear with me. We're in the
>>>> process of a first-time SGE deploy for the users in our department.
>>>> Although we've been able to use SGE, submit jobs to the queues
>>>> successfully, etc., we're running into issues trying to get the
>>>> fair-share scheduling - specifically the functional scheduling - to
>>>> work correctly.
>>>> We have very simple functional scheduling enabled, via the following
>>>> configuration settings:
>>>> enforce_user auto
>>>> auto_user_fshare 100
>>>> weight_tickets_functional 10000
>>>> schedd_job_info true
>>>> (In addition, the "weight_tickets_share" setting is set to 0,
>>>> thereby disabling share tree scheduling.)
>>>> A colleague and I are testing this setup by both of us submitting
>>>> multiple jobs to one of our queues simultaneously, with me first
>>>> submitting a large number of jobs (100) and he submitting a fewer
>>>> number (25) shortly afterwards. Our understanding is that the
>>>> functional scheduling policy should prevent one user from having
>>>> their jobs completely dominate a queue. And so our expectation is
>>>> that even though my jobs were submitted first, and there are more of
>>>> them, the scheduler should wind up giving his jobs a higher priority
>>>> so that he is not forced to wait until all of my jobs complete
>>>> before his run. (If he did have to wait, that would effectively be
>>>> FIFO scheduling, not fair share.)
>>> The display of the pending tickets has to be enabled too to see the
>>> effect (you should see them a being 0 right now in the pending list):
>>> report_pjob_tickets TRUE
>>> In addition you can set the:
>>> policy_hierarchy F
>>> -- Reuti
>> Thanks for the feedback.
>> We do have report_pjob_tickets set to TRUE. However, our
>> policy_hierarchy is set to OFS. Still, shouldn't that not be an issue
>> if we have weight_tickets_share set to zero? (I.e., if we're not
>> using override or shared tree, then shouldn't this be effectively
>> equivalent to "policy_hierarchy F"?)
> Yes, but can be streamlined.
> Are you mixing parallel and serial jobs? The default is an urgency in
> the slots complex which leads to the effect that jobs requesting more
> slots are more important.
> - -- Reuti
We were doing our testing with serial jobs, but our production loads
will largely be parallel. (Primarily array jobs.)
The default behavior you described (jobs requesting more slots being
considered more important) sounds like it explains what we were seeing.
FYI I also took the advice listed in an old post of yours to the list
(http://gridengine.org/pipermail/users/2017-May/009766.html) and echoed
by Ian K earlier in this thread and made the following setting changes:
Changing those settings does seem to be providing much more
balanced/fair scheduling now, as my colleague's jobs are now getting
much more interleaved with mine.
Thanks much for the suggestions!
More information about the users