[gridengine users] nodes in alarm
Marty Dippel
mdippel at iit.edu
Fri Sep 23 21:33:08 UTC 2011
Thanks, Ian!
I take it that "alarm" usually means something job-related (asking for
more resources than available, for example) as opposed to something gone
wrong in the queuing system per se.
Anyway, I'll try "-explain" - thanks!!
Marty
On 9/23/11 4:22 PM, Ian Kaufman wrote:
> On Fri, Sep 23, 2011 at 1:55 PM, Marty Dippel <mdippel at iit.edu> wrote:
>> SGE Newbie question-
>>
>> When I "qstat -f" a few of the nodes return an "a" state, which I
>> believe means the node is in alarm.
>>
>>
>> queuename qtype used/tot. load_avg arch states
>> ----------------------------------------------------------------------------
>> all.q at compute-4-6.local BIP 2/2 4.03 lx26-amd64 a
>> 35329 0.50894 finer3a abaezgua r 09/23/2011 11:08:04 2
>>
>> ----------------------------------------------------------------------------
>>
>>
>> 1. What's the best way for me to discover the cause of the alarm state?
>
> qstat -explain a JOBID
>
>>
>> 2. Once a node is in alarm, will it reset by itself when the condition
>> is corrected or will it require human intervention to clear this state?
>
> Depends on if the node can clear out the job or not without human
> intervention. Usually, its best to intervene.
>
> Ian
>
More information about the users
mailing list