[gridengine users] How to detect "blackhole" host in gridengine?
Fritz Ferstl
fferstl at univa.com
Wed Mar 16 15:22:31 UTC 2011
Am 16.03.11 16:10, Dave Love wrote:
> Fritz Ferstl<fferstl at univa.com> writes:
>
>> It should actually be quite easy. In a first implementation you'll
>> probably want to introduce a qmaster_param "black_hole_exit_rate" and
>> then keep a statistic of the exit frequency rate for each exec host near
>> the code where qmaster receives job completion information. Qmaster
>> would compare the exit frequency of the hosts against
>> black_hole_exit_rate and disable a host if its exit frequency is higher
>> than allowed by black_hole_exit_rate.
>>
>> A more advanced implementation would provide a black_hole_exit_rate per
>> exec host or even per cluster queue (i.e. per job class) plus per exec
>> host. The checking itself won't get much more complicated. The problem
>> with the more advanced approach is only that you'd have to modify the
>> format of the queue and host configuration. This would make that version
>> incompatible with earlier versions. So the upgrade step will get more
>> "involved".
>
> Understanding this might be more generally useful.
>
> What's the reason for doing it in the qmaster? The way I'd hope to be
> able to do it would be locally in execd, with and error state triggered
> if the rate exceeded that specified by a complex (or more than one). I
> don't know enough about the architecture, though, and maybe the execd
> doesn't have access to the relevant information, for one thing.
It's simply a matter or race conditions and getting to sanity quicker.
If you do it in the execd then during the time you figure things out
there and while you report the situation back to the qmaster, the
qmaster will keep sending jobs. Of course, you can catch those in
qmaster and send them back as having been unable to run them but there's
probably no other way to do that than using exit code 99 feature and
that could have undesired side effects (such as where in the pending
jobs list those jobs get returned into).
Doing it in the qmaster is certainly cleaner and it shouldn't be any
more complicated. As long as it is strictly exit-rate-based, the qmaster
is the right place. If you wanted to be smarter and analyze exactly why
a job has failed and make a decision to call it a blackhole dependent on
that then there'd need to be code in the execd.
> If this was implemented, it might be useful to try to ignore classes of
> error that seemed to be due to the user to balance the risks between
> losing jobs and knocking out the whole cluster. When there's a large
> array job -- or a big batch of jobs which should be an array job -- it's
> easy to knock out the cluster with some simple mistake already, as at
> least one sort of user error can put the queue in an error state
> (probably when the working directory disappears, but I can't remember
> off-hand).
Yep, that's a well known problem. Off the top of my head I don't recall
either and neither do I recall whether it might have been fixed but care
needs to be taken there.
Cheers,
Fritz
--
Fritz Ferstl -- CTO and Business Development, EMEA
Univa -- The Data Center Optimization Company
E-Mail: fferstl at univa.com
Web: http://www.univa.com
Phone: +49.9471.200.195
Mobile: +49.170.819.7390
---------------------------------------------------------------------
Notice from Univa Postmaster:
This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. This message has been content scanned by the Univa Mail system.
---------------------------------------------------------------------
More information about the users
mailing list