[gridengine dev] [DRAFT PATCH] Enhancement: exempt certain programs from execd control

Maes, Richard rmaes at ciena.com
Fri Nov 18 00:57:25 UTC 2011


Rayson,
Thanks for the feedback. I'm unfamiliar with Qping.  I'll need to look at what that does for me.  I been working on the issue for a while.  The file descriptor issue was a problem at one time.  About a year ago, I increased the limits to what you see below.  That allowed me to reach 400 - 500 concurrent jobs in queue.  Does this look reasonable to you?

[waxgridqm.ciena.com(rmaes)]-> ~ 103> limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    10240 kbytes
coredumpsize 0 kbytes
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  32768 
memorylocked 32 kbytes
maxproc      147456 
[waxgridqm.ciena.com(rmaes)]-> ~ 104>



-----Original Message-----
From: Rayson Ho [mailto:rayrayson at gmail.com] 
Sent: Thursday, November 17, 2011 4:49 PM
To: Maes, Richard
Cc: Mark Dixon; dev at gridengine.org
Subject: Re: [gridengine dev] [DRAFT PATCH] Enhancement: exempt certain programs from execd control

On Thu, Nov 17, 2011 at 7:37 PM, Maes, Richard <rmaes at ciena.com> wrote:
>  I have seen an issue when running 500+ jobs using this
> mechanism.  Is there a log I can look at to determine if the qmaster is
> killing jobs due to exceeding memory limits?

What version of SGE are you using?? And how many execution hosts do you have??

Each qsub -sync uses a socket connection, which means there's a file
descriptor used up. If the system limit is 1024, and if you have close
to 500 execution hosts, then qmaster will run out of descriptors
handling that many qsub -sync connections.

If you want to debug this problem:

- check shell limit (ulimit or limit)
- qping

Rayson





> Rich
>
>
>
> -----Original Message-----
> From: dev-bounces at gridengine.org [mailto:dev-bounces at gridengine.org] On
> Behalf Of Mark Dixon
> Sent: Thursday, November 10, 2011 4:44 AM
> To: dev at gridengine.org
> Subject: [gridengine dev] [DRAFT PATCH] Enhancement: exempt certain
> programs from execd control
>
> In an attempt to lighten people's mood today...
>
> Please find a draft patch attached, allowing a GE admin to specify a
> list of programs that are not counted against a job's resource limits.
> The patch has been prepared against a vanilla ge-6.2u5 and only gives
> this feature to Linux execd's.
>
> This is to address a problem where tightly-integrated parallel jobs can
> be killed due to all the instances of qrsh exceeding h_vmem on the
> MASTER, as documented here:
>
> https://arc.liv.ac.uk/trac/SGE/ticket/694
>
> Is this something the major forks would be interested in?
>
> This is my first GE patch and I'm very rusty (it's been over decade
> since I last worked on a big C code), so any and all
> help/comments/insults/rotten fruit would be welcomed.
>
>
> How to use once applied and built:
>
> "qconf -mconf" - add EXEMPT_PROGRAMS=<filename> to the execd_params line
> (use ":" as a delimiter, if more than one filename is desired).
>
> Setting this to contain whatever filenames `ls $SGE_ROOT/bin/*/qsh`
> expands to on your system should fix the mpirun-qrsh problem (qrsh is
> symlinked to qsh).
>
>
> Major concerns yet to resolve:
>
> 1) I'm not happy with how I'm configuring this feature. Using
> execd_params doesn't seem to fit nicely, so am still looking for the
> best way to do this (a line in "qconf -se", allowing per-execd config,
> perhaps? Or a separate line in "qconf -sconf"?)
>
> 2) Using execd_params also means that simply removing the variable
> doesn't actually turn off this feature - you need to set
> "EXEMPT_PROGRAMS=" and wait a bit for the config to propogate to the
> execds, before removing the reference completely. Ugh.
>
> 3) I think it will break running a standalone "pdc" binary, as pt_open
> now refers to the cluster configuration. Don't know how big an issue
> this is.
>
> 4) I don't know what "touch_time_stamp" is supposed to achieve, but
> other
> (similar) bits use it. Commented-out for now.
>
> 5) The name of the featire. I don't like the string "exempt_programs" I
> keep using. "resource_exempt" instead?
>
> 6) I assume this needs to be submitted under the SISSL 1.2?
>
> 7) I'm not happy about the amount of memory allocation error checking
> going on... but seems to be on par with similar routines.
>
>
> Any guidance would be wonderful :)
>
> Thanks,
>
> Mark
> --
> -----------------------------------------------------------------
> Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
>
>
> _______________________________________________
> dev mailing list
> dev at gridengine.org
> https://gridengine.org/mailman/listinfo/dev
>




More information about the dev mailing list