[gridengine users] How to detect "blackhole" host in gridengine?

Mark Dixon m.c.dixon at leeds.ac.uk
Mon Mar 14 14:31:27 UTC 2011


On Thu, 10 Mar 2011, Rayson Ho wrote:
...
> In LSF, the admin can define the EXIT_RATE for the host & the
> GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this
> can only be done in the starter_method, as it knows when jobs are
> started & when jobs exit. So a simple one would write to some sort of
> /tmp area, and do some math to come up with the rate. When a job
> exceeds the EXIT_RATE threshold, then it will close the queue/host.
...

Good idea. For what it's worth, I would do this in the prolog or epilog, 
not in the starter_method. For the following reasons:

1) If using the prolog, you should be able to disable the node then exit 
99 to cause the job to reschedule to another node.

2) If using the epilog, you have the opportunity to disable a node 
immediately after a job has caused a problem (which can aid scheduling 
efficiency if you have a lot of jobs requesting Resource Reservations) 
instead of detecting at job start. You do get the odd sacrificial job if a 
problem spontaneously develops, though.

3) Writing a starter_method that can cope with all the different ways that 
tightly-integrated MPI implementations start can be fiddly. [if anyone's 
interested, sticking an  eval "$@"  at the bottom of the script solves 
most of them]


On our GE 6.0 cluster, we put in a check/disable at job end. I have lost 
count of the number of times this has saved queued jobs from vanishing 
down a plughole over the past 5+ years. Bizarrely, it also meant that the 
users got the impression that the cluster was very highly stable, despite 
it having relatively flaky Myrinet 2000 hardware/drivers on it :)

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------


More information about the users mailing list