[gridengine users] How to detect "blackhole" host in gridengine?
elauzier at broadinstitute.org
Thu Mar 10 16:58:51 UTC 2011
Thanks for the input...
I like the idea of a boolean load sensor. It could be used to set the value
of a host-specific
boolean complex resource...and a default job submission could say...
This may work...
On Thu, Mar 10, 2011 at 11:47 AM, Reuti
> Am 10.03.2011 um 17:20 schrieb Edward Lauzier:
> > This was caused by one host having a scsi disk error...
> > sge_execd was ok, but could not properly fire up the shepherd...
> > ( we could not log into the console...because of disk access errors....)
> > So, the jobs failed with the error message:
> > 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host
> node1182 invalid execution state because: shepherd exited with exit status
> 127: invalid execution state
> > And, man did it chew through a lot of jobs fast...
> > We set the load adjust to 0.50 per job for one minute to and load formula
> to slots...
> > Things run fine and fast...
> > And the scheduler can really dispatch fast, esp to a blackhole host...
> well, the feature to use the hawking radiation to allow the jobs to pop up
> on other nodes needs precise alignment of the installation - SCNR
> There is a demo script to check the size of e.g. /tmp here
> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use
> "load_thresholds tmpfree=1G" in the queue definition, so that the queue
> instance is set to alarm state in case it falls below a certain value.
> A load sensor can also deliver a boolean value, hence checking locally
> something like "all disks fine" and use this as a "load_threshold" can also
> be a solution. How to check something is of course specific to your node
> The last necessary piece would be to inform the admin: this could be done
> by the load sensor too, but as the node is known not to be in a proper state
> I wouldn't recommend this. Better might be a cron-job on the qmaster machine
> checking `qstat -explain a -qs a -u foobar` *) to look for passed load
> -- Reuti
> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown
> user "foobar" will help. And OTOH there is no "load_threshold" in the
> exechost definition.
> > -Ed
> > Hi,
> > Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
> > > I'm looking for best practices and techniques to detect blackhole hosts
> > > and disable them. ( Platform LSF has this already built in...)
> > >
> > > What I see is possible is:
> > >
> > > Using a cron job on a ge client node...
> > >
> > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> > > - if detected, use qmod -d '<queue_instance>' to disable
> > > - send email to ge_admin list
> > > - possibly send email of failed jobs to user(s)
> > >
> > > Must be robust to be able to timeout properly when ge is down or too
> > > for qmod to respond...and/or filesystem problems, etc...
> > >
> > > ( perl or php alarm and sig handlers for proc_open work well for
> enforcing timeouts...)
> > >
> > > Any hints would be appreciated before I start on it...
> > >
> > > Won't take long to write the code, just looking for best practices and
> > > a setting I'm missing in the ge config...
> > what is causing the blackhole? For example: if it's a full file system on
> a node, you could detect it by a load sensor in SGE and define in the queue
> setup an alarm threshold, so that no more jobs are schedule to this
> particular node.
> > -- Reuti
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users