[gridengine users] Large cluster with memory reservation leaving cores idle

Reuti reuti at staff.uni-marburg.de
Tue Mar 8 12:32:08 UTC 2016


Hi,

> Am 08.03.2016 um 00:20 schrieb Christopher Black <cblack at nygenome.org>:
> 
> Greetings!
> We are running SoGE (mix of 8.1.6 and 8.1.8, soon 8.1.8 everywhere) on a
> ~300 node cluster.
> We utilize RQS and memory reservation via a complex to allow most nodes to
> be shared among multiple queues and run a mix of single core and multi
> core jobs.
> Recently when we hit 10k+ jobs in qw, we are seeing the job dispatch rate
> not keep up with how quickly jobs are finishing and leaving cores idle.
> Our jobs aren't particularly short (avg ~2h).
> We sometimes have a case where there are thousands of jobs not suitable
> for execution due to hitting a per-queue RQS rule, but we still want other
> jobs to get started on idle cores.
> 
> We have tried tuning some parameters but could use some advice as we are
> now having trouble keeping all the cores busy despite there being many
> eligible jobs in qw.
> 
> We have tried tuning max_advance_reservations,
> max_functional_jobs_to_schedule, max_pending_tasks_per_job,
> max_reservation as well as disabling schedd_job_info. We have applied some
> of the scaling best practices such as using local spools. I saw mention of
> MAX_DYN_EC but have not tried that yet, is it fairly safe to do so?
> Any other changes we should consider?
> 
> One thing I am not clear on is whether max_functional_jobs_to_schedule
> being low means that only the first n jobs with the highest calculated
> priority are evaluated for starting. If this were true it would mean high
> priority jobs that are not eligible for execution due to RQS or other
> reasons would prevent other lower priority jobs from starting.
> 
> Any thoughts or suggestions?
> 
> Also, we sometimes see the following in spool/qmaster/messages:
> 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
> start of job 7499766.1
> 03/07/2016 18:00:09|worker|pqmaster|E|debiting 31539000000.000000 of
> h_vmem on host pnode141.nygenome.org for 1 slots would exceed remaining
> capacity of 2986364800.000000
> 03/07/2016 18:00:09|worker|pqmaster|E|resources no longer available for
> start of job 7499767.1
> 03/07/2016 18:00:09|worker|pqmaster|E|debiting 18022000000.000000 of
> h_vmem on host pnode176.nygenome.org for 1 slots would exceed remaining
> capacity of 9887037760.000000
> 
> I expect this is due to the memory reservation, but I'm not sure the exact
> cause, if it is a problem, or if a parameter change might improve
> operations. One theory is that when looking to do reservations on hundreds
> of jobs, by the time it gets part way through the list the memory that
> would have been reserved in consumable resource has been allocated to
> another job, but I'm not sure as I don't see many hits on that log message.
> (update: just found
> http://arc.liv.ac.uk/pipermail/sge-bugs/2016-February.txt)
> I don't know if this is a root cause of our problems leaving cores idle as
> we see some of these even when everything is running fine.

I saw cases were RQS blocks further scheduling and shows up in `qstat -j` with a cryptic message. Although this was in 6.2u5, I don't know whether there was any work in this area to fix it.

Often you can spot it in the scheduling output that an RQS was violated although it's not true that the rule is violated. For me it kicked in when I requested a complex with a load value in the submission command.

cannot run because it exceeds limit "////node20/" in rule "general/slots"

AFAICS:

> Thanks,
> Chris
> 
> Some config snippets showing non-default and potentially-relevant values,
> I can put full output to a pastebin if it is useful:
> qconf -srqs:
> {
>   name         slots_per_host
>   description  Limit slots per host
>   enabled      TRUE
>   limit        hosts {@16core} to slots=16
>   limit        hosts {@20core} to slots=20
>   limit        hosts {@28core} to slots=28
>   limit        hosts {!@physicalNodes} to slots=2
> }

The above RQS could be put in individual complex_values per exechost. Yes - the above is handy, I know.


> {
>   name         io
>   description  Limit max concurrent io.q slots
>   enabled      TRUE
>   limit        queues io.q to slots=300
> }
> {
>   name         dev
>   description  Limit max concurrent dev.q slots
>   enabled      TRUE
>   limit        queues dev.q to slots=250
> }
> {
>   name         pipeline
>   description  Limit max concurrent pipeline.q slots
>   enabled      TRUE
>   limit        queues pipeline.q to slots=4000
> }
> ...other queues..

Here one could use a globale complex for each type of queue, as long as the users specify the particular queue. One will lose the ability that potentially a job may be scheduled to different types of queues, as long as resource requests are met.

I can't predict whether this would improve anything in the situation you face.

-- Reuti


> qconf -sc|grep mem (note default mem per job is 8GB and this is
> consumable):
> h_vmem              mem        MEMORY    <=      YES         JOB        8G
>      0
> 
> A typical exechost qconf -se:
> complex_values        h_vmem=240G,exclusive=true
> 
> qconf -sconf:
> shell_start_mode             unix_behavior
> reporting_params             accounting=true reporting=false \
>                             flush_time=00:00:15 joblog=true
> sharelog=00:00:00
> finished_jobs                100
> gid_range                    20000-20100
> max_aj_instances             3000
> max_aj_tasks                 75000
> max_u_jobs                   0
> max_jobs                     0
> max_advance_reservations     50
> 
> qconf -msconf:
> schedule_interval                 0:0:45
> maxujobs                          0
> queue_sort_method                 load
> 
> schedd_job_info                   false  (this used to be true, as qstat
> -j on a stuck job can be useful)
> params                            monitor=false
> max_functional_jobs_to_schedule   1000
> max_pending_tasks_per_job         50
> max_reservation                   0  (used to be 50 to allow large jobs
> with -R y to have a better chance to run)
> default_duration                  4320:0:0
> 
> This electronic message is intended for the use of the named recipient only, and may contain information that is confidential, privileged or protected from disclosure under applicable law. If you are not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reading, disclosure, dissemination, distribution, copying or use of the contents of this message including any of its attachments is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and destroy all copies of this message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email.
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list