[gridengine users] nodes in alarm

Marty Dippel mdippel at iit.edu
Fri Sep 23 21:33:08 UTC 2011


Thanks, Ian!

I take it that "alarm" usually means something job-related (asking for
more resources than available, for example) as opposed to something gone
wrong in the queuing system per se.

Anyway, I'll try "-explain" - thanks!!

Marty



On 9/23/11 4:22 PM, Ian Kaufman wrote:
> On Fri, Sep 23, 2011 at 1:55 PM, Marty Dippel <mdippel at iit.edu> wrote:
>> SGE Newbie question-
>>
>> When I "qstat -f" a few of the nodes return an "a" state, which I
>> believe means the node is in alarm.
>>
>>
>> queuename                      qtype used/tot. load_avg arch          states
>> ----------------------------------------------------------------------------
>> all.q at compute-4-6.local        BIP   2/2       4.03     lx26-amd64    a
>>  35329 0.50894 finer3a    abaezgua     r     09/23/2011 11:08:04     2
>>
>> ----------------------------------------------------------------------------
>>
>>
>> 1. What's the best way for me to discover the cause of the alarm state?
> 
> qstat -explain a JOBID
> 
>>
>> 2. Once a node is in alarm, will it reset by itself when the condition
>> is corrected or will it require human intervention to clear this state?
> 
> Depends on if the node can clear out the job or not without human
> intervention. Usually, its best to intervene.
> 
> Ian
> 


More information about the users mailing list