[gridengine users] Repeated error message in logs from RQS rules

William Hay w.hay at ucl.ac.uk
Mon Jul 17 09:03:11 UTC 2017


On Fri, Jul 14, 2017 at 08:36:06AM +0000, Simon Andrews wrote:
>    Can anyone shed any light on an error I'm getting repeated thousands of
>    times in my grid engine messages log.  This happens when I have a job
>    which is submitted and which is stopped from running by an RQS rule I have
>    set up.  The error I get is:
> 
>     
> 
>    07/14/2017 09:27:08|schedu|rocks1|C|not a single host excluded in
>    rqs_excluded_hosts()
> 
>     
> 
>    The RQS ruleset I have which triggers this looks like:
> 
Not so much a fix but a possible workaround:
Send your logs to syslog (rather than having qmaster log directly into files) and rely
on the syslog replacing repeated messages with 'last message repeated <n> times

You could also try tweaking the log_level parameter.

I don't use RQS myself but my best guess is that you have two sorts of hosts.
Regular with a batch queue and the hosts in @interactive with an interactive queue
Because the hosts {@interactive} clause doesn't further restrict where the limit
applies (because jobs are already limited by being batch or interactive) grid engine 
complains that you appear to have a no-op in yor limit.  I think this complaint by SGE 
is spurious.

Possibly:
Give the interactive queue a different name from the regular batch queue.  Make sure the batch 
queue can't run on the interactive hosts and vice versa.  Then apply the limit to the queue
rather than the host.

>     
> 
>    {
> 
>       name         per_user_slot_limit
> 
>       description  "limit the number of slots per user"
> 
>       enabled      TRUE
> 
>       limit        users {*} hosts {@interactive} to slots=8
> 
>       limit        users {andrewss} to slots=2
> 
>       limit        users {@bioinf} to slots=616
> 
>       limit        users {*} to slots=411
> 
>    }
> 
>     
> 
>    The rule seems to work, and jobs are held, and then started as expected. 
>    A job which fails to schedule gets a state like this:
> 
>     
> 
>    scheduling info:            cannot run in queue instance
>    "all.q at compute-1-6.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-5.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-7.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-0.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-3.local" because it is not of type batch
> 
>                                cannot run because it exceeds limit
>    "andrewss/////" in rule "per_user_slot_limit/3"
> 
>                                cannot run in queue instance
>    "all.q at compute-1-4.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-1.local" because it is not of type batch
> 
>                                cannot run in queue instance
>    "all.q at compute-1-2.local" because it is not of type batch
> 
>     
> 
>    So it's seeing the rule and is applying it correctly, but the spurious
>    errors are causing my messages file to inflate quickly when there are a
>    lot of queued jobs.
> 
>     
> 
>    Can anyone suggest how to debug or fix this?  I can't find anything
>    relevant from googling around for the specific error outside of the
>    library API it comes from.
> 
>     
> 
>    This is using SGE-6.2u5p2-1.x86_64.
> 
>     
> 
>    Thanks for any help you can offer!
> 
>     
> 
>    Simon.
> 
>     
> 
>     
> 
>    The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT
>    Registered Charity No. 1053902.
> 
>    The information transmitted in this email is directed only to the
>    addressee. If you received this in error, please contact the sender and
>    delete this email from your system. The contents of this e-mail are the
>    views of the sender and do not necessarily represent the views of the
>    Babraham Institute. Full conditions at: www.babraham.ac.uk

> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://gridengine.org/pipermail/users/attachments/20170717/cb0ac4e4/attachment.sig>


More information about the users mailing list