[gridengine users] nodes in alarm

Reuti reuti at staff.uni-marburg.de
Fri Sep 23 22:05:04 UTC 2011


Am 23.09.2011 um 23:33 schrieb Marty Dippel:

> Thanks, Ian!
> 
> I take it that "alarm" usually means something job-related (asking for
> more resources than available, for example) as opposed to something gone
> wrong in the queuing system per se.

No, it's a "problem" with the node. Please check the setting of:

$ qconf -sq all.q
...
load_threshold    NONE

The default is np_load_avg=1.75 with is more or less useless nowadays. Problem is, that also processes in state "D" (uninterruptible kernel task" which points to "waiting for disk" are there). So, a load higher than the installed cores times 1.75 can still be fine. [Originally it was the length of the process chain, i.e. number of process which are eligible to get some cpu cycles. As long as this number is lower than the number of installed cores, all processes are running at full speed (despite any set nice values), as there is noone to be nice to. Only with more processes than cores there is something to share)]

Especially if you have slots = cores per machine defined you need no load_threshold.

I think it was invented at a time, when you had big SMP machines with 256 cores (which is only one node to SGE) and intend to oversubscribe the node by intention (as you are aware of the fact, that not all parallel applications are really running in a linear scaling and left some cpu cycles idle). So, maybe it was fine to define 512 slot in the above machine. Only when you discovered that the load passed 1.75 you got an alarm state.

The alarm state is nothing where you should get a heart attack. It only means that the load_threshold was passed and therefore no more job will be scheduled to this machine, unless the reason for the alarm vanishes again.

I use it in combination with a load sensor from the Howto page to check whether the local scratch space was filled up on a node, as this could result in a black hole (job starts, crashes, next job start, crahses,...)

> Anyway, I'll try "-explain" - thanks!!

Looks like it can even be used without a JOBID.

-- Reuti


> Marty
> 
> 
> 
> On 9/23/11 4:22 PM, Ian Kaufman wrote:
>> On Fri, Sep 23, 2011 at 1:55 PM, Marty Dippel <mdippel at iit.edu> wrote:
>>> SGE Newbie question-
>>> 
>>> When I "qstat -f" a few of the nodes return an "a" state, which I
>>> believe means the node is in alarm.
>>> 
>>> 
>>> queuename                      qtype used/tot. load_avg arch          states
>>> ----------------------------------------------------------------------------
>>> all.q at compute-4-6.local        BIP   2/2       4.03     lx26-amd64    a
>>> 35329 0.50894 finer3a    abaezgua     r     09/23/2011 11:08:04     2
>>> 
>>> ----------------------------------------------------------------------------
>>> 
>>> 
>>> 1. What's the best way for me to discover the cause of the alarm state?
>> 
>> qstat -explain a JOBID
>> 
>>> 
>>> 2. Once a node is in alarm, will it reset by itself when the condition
>>> is corrected or will it require human intervention to clear this state?
>> 
>> Depends on if the node can clear out the job or not without human
>> intervention. Usually, its best to intervene.
>> 
>> Ian
>> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list