[gridengine users] Limit number of jobs by job name
Reuti
reuti at staff.uni-marburg.de
Mon Feb 6 21:44:27 UTC 2012
Hi,
Am 06.02.2012 um 22:25 schrieb Lane Schwartz:
> I have a large number of jobs that I need to run. Each of these jobs
> kicks off a number of child jobs. The child jobs do most of the actual
> work - the parent jobs mostly sit and wait until the child jobs have
> completed.
>
> Ideally, I would like to kick off all of my parent jobs, and let them
> spawn off all of their respective child jobs, and wait until
> everything finishes. But there's a problem with this. If I kick off
> all of the parent jobs, then the parent jobs take up lots of slots in
> my grid, and it takes far longer than it should for the grid to work
> through all of the child jobs, because the parent jobs are taking up
> so many compute slots.
>
> To solve this problem, it occurred to me that it would be nice if I
> could specify (perhaps by job name) a maximum number of parent jobs
> that can simultaneously be executing.
>
> The way I'm currently working around this problem is the following. I
> launch one or two parent jobs, then wait until they have spawned their
> child jobs. At this point all of the slots in my grid have been
> filled. I then launch the rest of my parent jobs, which don't run,
> because no slots are available. I then use qmon to lower the priority
> of my waiting parent jobs. This works OK, but later on I still
> sometimes end up with too many parent jobs running simultaneously.
>
> I've looked through the documentation to try to find a better
> solution. The closest thing I've found is the -tc flag to qsub, which
> allows me to limit the number of concurrent array jobs executing.
> Unfortunately, the parent jobs are not themselves array jobs, and
> while I suppose I could try to rewrite the parent launch scripts to
> launch as an array job, this would be less than ideal.
>
> I was wondering if anyone has any other ideas on how to specify that
> no more than n instances of jobs with a specified name should be able
> to run simultaneously. I'd be open to other mechanisms, too.
As the parent jobs are not doing any work, a special parent.q would do which has to be requested by a forced boolean complex, so that only parent jobs can get in. You could even set a h_cpu limit on this queue to avoid abuse - jobs abusing this queue would get killed after 5 minutes or so. The overall slot count used in this cluster queue you can limit in an RQS.
-- Reuti
More information about the users
mailing list