[gridengine users] execd load sensors timing
w.hay at ucl.ac.uk
Mon Jul 9 13:54:46 UTC 2012
On 9 July 2012 14:08, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.07.2012 um 14:51 schrieb William Hay:
>> On 9 July 2012 12:50, Reuti <reuti at staff.uni-marburg.de> wrote:
>>> Am 09.07.2012 um 11:42 schrieb William Hay:
>>>> When execd starts is it safe to assume that the load sensors will be
>>>> run and reported back to the qmaster/scheduler before the node is
>>>> contactable/eligible for scheduling again?
>>>> I have a load sensor that reports when the node was last booted and
>>>> would like to be sure that the time used for scheduling decisions is
>>> No. The load sensor will only be triggered with the next interval when it's triggered in the usual cycle AFAICS when I start the execd on a particular node.
>>> To avoid it, you could report a BOOLEAN in the load sensor too and use this as an entry in load_thresholds in the queue definition to put the queue instance into alarm state (i.e. don't get any jobs scheduled thereto), as long as the load sensor doesn't report TRUE to reflect available.
>> Would there not be a similar risk there though where the boolean is
>> cached from before a reboot or do load thresholds work differently?
> If you reboot to fast: yes. So the old values should first vanish from the load report.
How does one determine what is too fast?
> You can set "initial_state" disabled in the queue configuration, so that queue on this exechost needs to be enabled first after a reboot.
Really want to keep the initial_state at enabled. The point of the
exercise is to let grid engine schedule node reboots for us. We
do this by submitting jobs targeted at specific hosts but it can take
a lot of time this way. We have a lot of checks that run before
sge_execd is started so it is safe for jobs to run immediately
post-reboot. This helps minimise down time of individual nodes.
More information about the users