[gridengine users] lots of jobs for one user

Reuti reuti at staff.uni-marburg.de
Wed Feb 11 20:23:23 UTC 2015


Am 11.02.2015 um 20:52 schrieb Michael Stauffer:

> On Wed, Feb 11, 2015 at 2:30 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> > Am 11.02.2015 um 20:03 schrieb Michael Stauffer <mgstauff at gmail.com>:
> >
> > OGS/GE 2011.11p1
> >
> > Hi again,
> >
> > I've got a user who's got 240+ running jobs (single slot) in the default queue (and 1400 queued and waiting), when the usual slot quota is about 50. I say 'usual' because I'm running a simple script that modifies everyone's slot quota depending on the overall cluster usage. When lots of slots are available, the quota goes up to a max of 100. I checked the logs from the script (it runs every minute) and over the time period that these 240+ jobs were submitted, the max slot quota never went above 97.
> >
> > My script examines the current cluster state, then dumps out a new rqs file, which then gets loaded via 'qconf -Mrqs'. The script gets called every minute. The queue scedule interval is one second:
> >
> >   schedule_interval                 0:0:1
> 
> Are the jobs so short that such a short interval is necessary? It will put some load on the scheduler.
> 
> No they're not so short. I had this just to give the user the fastest response possible. I don't notice any overhead on my system, usually there's at most a few hundred jobs in the queue and we have an overpowered head node. But I'll change it to 2 sec for good measure.
>  
> 
> 
> > Anyone have an idea how this might have happened? If the user submits a lot of jobs in the split-second when 'qconf -Mrqs' is updating, could the scheduler get confused and start more jobs than it should? Any suggestions on how to dig around to see what happened? Thanks.
> 
> I can't say for sure, but instead of creating an altered file of the output, it's also possible to change individual lines like:
> 
> $ qconf -mattr resource_quota limit slots=4 general/3
> $ qconf -mattr resource_quota limit slots=4 general/short # here the limit got a name
> $ qconf -mattr resource_quota enabled TRUE general
> 
> for an RQS called "general".
> 
> OK seems like a great idea. By 'can't say for sure' do you mean you don't know for sure if this will avoid the problem?

Exactly. Sometime RQS are not working, although they should. To me it was never clear, when exactly they are failing.

-- Reuti


> Seems very likely.
>  
> 
> A safety net could be setup in addition in the scheduler configuration with "maxujobs".
> 
> Yes, good idea. I had that set once but removed it for some reason, can't remember.
> 
> Also I figure I could disable all queues before I make the changes, then reenable.
> 
> -M
>  
> 
> -- Reuti
> 





More information about the users mailing list