[gridengine users] Grid queue goes into an error state due to one job

Simon Andrews simon.andrews at babraham.ac.uk
Mon May 18 20:22:12 UTC 2015


If we're making a list then we've had a problem with super-long command
lines - qsub accepts the command, but the shell says it's too big
"Argument list too long", and the node goes down.

Simon.

On 18/05/2015 15:48, "Tina Friedrich" <Tina.Friedrich at diamond.ac.uk> wrote:

>I can add 'network file system gone AWOL on a node' to the list of
>common causes, I think...
>
>Tina
>
>On 18/05/15 15:03, Skylar Thompson wrote:
>> That's been our experience too, with the second highest cause a
>>segfault in
>> the user's code.
>>
>> You can figure out for sure by looking at the exec daemon's messages
>>file.
>>
>> On Mon, May 18, 2015 at 02:52:15PM +0200, Nicols Serrano Martnez-Santos
>>wrote:
>>> It can be caused by multiple issues. The most common cause in my
>>>department is
>>> that HDD of the execution host is full, so Grid Engine put the host in
>>>error to
>>> prevent more errors.
>>>
>>> NiCo
>>>
>>> Excerpts from sudha.penmetsa's message of 2015-05-18 14:45:48 +0200:
>>>> Hi Gavin,
>>>>
>>>> I clear the error state using qmod -c "*".
>>>>
>>>> Wanted to know the root cause and the solution to fix the issue
>>>>permanently.
>>>>
>>>> Regards,
>>>> Sudha
>>>>
>>>> -----Original Message-----
>>>> From: Gavin W. Burris [mailto:bug at wharton.upenn.edu]
>>>> Sent: Monday, May 18, 2015 6:08 PM
>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>> Cc: users at gridengine.org
>>>> Subject: Re: [gridengine users] Grid queue goes into an error state
>>>>due to one job
>>>>
>>>> Hello, Sudha.
>>>>
>>>> Give this a try:  qmod -c "*"
>>>>
>>>> Cheers.
>>>>
>>>>
>>>> On 10:51AM Mon 05/18/15 +0000, sudha.penmetsa at wipro.com wrote:
>>>>> Hi,
>>>>>
>>>>> We have few hosts added to a queue. Due to one single job submitted
>>>>>to the queue the whole queue goes into error state.
>>>>>
>>>>> As a result, no new jobs can  be submitted to the queue unless we
>>>>>clear the error state.
>>>>>
>>>>> Can anyone please let me know what could be the reason for this and
>>>>>how to fix it permanently.
>>>>>
>>>>> Ex
>>>>>
>>>>> test.q at host1              BIP   7/40      10.86    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host1
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>-----
>>>>> test.q at host2              BIP   7/40      10.74    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host2
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>------
>>>>> test.q at host3              BIP   10/40     10.73    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host3
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>------
>>>>> test.q at host4              BIP   8/40      11.28    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host4
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>------
>>>>> test.q at host5             BIP   7/40      11.52    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host5
>>>>>
>>>>>----------------------------------------------------------------------
>>>>>------
>>>>> test.q at host6              BIP   8/40      10.41    lx24-amd64    E
>>>>>          queue test.q marked QERROR as result of job 8169748's
>>>>>failure
>>>>> at host host6
>>>>>
>>>>> Regards,
>>>>> Sudha
>>>>> The information contained in this electronic message and any
>>>>> attachments to this message are intended for the exclusive use of the
>>>>> addressee(s) and may contain proprietary, confidential or privileged
>>>>> information. If you are not the intended recipient, you should not
>>>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>>>> immediately and destroy all copies of this message and any
>>>>> attachments. WARNING: Computer viruses can be transmitted via email.
>>>>> The recipient should check this email and any attachments for the
>>>>> presence of viruses. The company accepts no liability for any damage
>>>>> caused by any virus transmitted by this email. www.wipro.com
>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>>
>>>> --
>>>> Gavin W. Burris
>>>> Senior Project Leader for Research Computing The Wharton School
>>>>University of Pennsylvania
>>>> The information contained in this electronic message and any
>>>>attachments to this message are intended for the exclusive use of the
>>>>addressee(s) and may contain proprietary, confidential or privileged
>>>>information. If you are not the intended recipient, you should not
>>>>disseminate, distribute or copy this e-mail. Please notify the sender
>>>>immediately and destroy all copies of this message and any
>>>>attachments. WARNING: Computer viruses can be transmitted via email.
>>>>The recipient should check this email and any attachments for the
>>>>presence of viruses. The company accepts no liability for any damage
>>>>caused by any virus transmitted by this email. www.wipro.com
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users at gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>
>
>--
>Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
>Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
>
>--
>This e-mail and any attachments may contain confidential, copyright and
>or privileged material, and are for the use of the intended addressee
>only. If you are not the intended addressee or an authorised recipient of
>the addressee please notify us of receipt by returning the e-mail and do
>not use, copy, retain, distribute or disclose the information in or
>attached to the e-mail.
>Any opinions expressed within this e-mail are those of the individual and
>not necessarily of Diamond Light Source Ltd.
>Diamond Light Source Ltd. cannot guarantee that this e-mail or any
>attachments are free from viruses and we cannot accept liability for any
>damage which you may sustain as a result of software viruses which may be
>transmitted in or with the message.
>Diamond Light Source Limited (company no. 4375679). Registered in England
>and Wales with its registered office at Diamond House, Harwell Science
>and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>_______________________________________________
>users mailing list
>users at gridengine.org
>https://gridengine.org/mailman/listinfo/users

The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>



More information about the users mailing list