[gridengine users] How to detect "blackhole" host in gridengine?
Mark Dixon
m.c.dixon at leeds.ac.uk
Mon Mar 14 14:31:27 UTC 2011
On Thu, 10 Mar 2011, Rayson Ho wrote:
...
> In LSF, the admin can define the EXIT_RATE for the host & the
> GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this
> can only be done in the starter_method, as it knows when jobs are
> started & when jobs exit. So a simple one would write to some sort of
> /tmp area, and do some math to come up with the rate. When a job
> exceeds the EXIT_RATE threshold, then it will close the queue/host.
...
Good idea. For what it's worth, I would do this in the prolog or epilog,
not in the starter_method. For the following reasons:
1) If using the prolog, you should be able to disable the node then exit
99 to cause the job to reschedule to another node.
2) If using the epilog, you have the opportunity to disable a node
immediately after a job has caused a problem (which can aid scheduling
efficiency if you have a lot of jobs requesting Resource Reservations)
instead of detecting at job start. You do get the odd sacrificial job if a
problem spontaneously develops, though.
3) Writing a starter_method that can cope with all the different ways that
tightly-integrated MPI implementations start can be fiddly. [if anyone's
interested, sticking an eval "$@" at the bottom of the script solves
most of them]
On our GE 6.0 cluster, we put in a check/disable at job end. I have lost
count of the number of times this has saved queued jobs from vanishing
down a plughole over the past 5+ years. Bizarrely, it also meant that the
users got the impression that the cluster was very highly stable, despite
it having relatively flaky Myrinet 2000 hardware/drivers on it :)
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
More information about the users
mailing list