[gridengine users] commlib errors?

Skylar Thompson skylar2 at u.washington.edu
Thu Jul 12 15:05:18 UTC 2012


We tried doing that, and it fixed most of our issues, but not the SGE 
ones. We're seeing very high CPU load on the sge_shepherd and sge_execd 
processes. We were seeing high load on our sge_qmaster process until we 
restarted.

-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine

On 07/12/12 07:59 AM, Adam Tygart wrote:
> In my experience, a workaround to the leapsecond bug is to run
>
> date -s "`date`"
>
> on all of the hosts in the cluster. This caused all of the services
> experiencing issues to calm down and return to normal without
> restarting any services.
>
> --
> Adam Tygart
> Beocat Sysadmin
>
> On Thu, Jul 12, 2012 at 9:55 AM, Skylar Thompson
> <skylar2 at u.washington.edu>  wrote:
>> We've been seeing this issue too. It seemed to be correlated with the leap
>> second that got added on the 7th. RHEL6 has a bug that causes issues for
>> threaded applications. We're working on restarting all our sge_execd's, but
>> not all of them want to softstop properly.
>>
>> See this knowledgebase article for more info:
>>
>> https://access.redhat.com/knowledge/articles/15145
>>
>> We're not totally certain this is the issue, but it's highly correlated in
>> time with the leap second.
>>
>> -- Skylar Thompson (skylar2 at u.washington.edu)
>> -- Genome Sciences Department, System Administrator
>> -- Foege Building S046, (206)-685-7354
>> -- University of Washington School of Medicine
>>
>>
>> On 07/12/12 07:50 AM, Michael Coffman wrote:
>>>
>>> I am intermittently seeing the following on the command line when
>>> attempting to run qrsh with out any options:
>>>
>>> error: error running IJS server: "can't create tty_to_commlib thread:
>>> timeout while waiting for thread start"
>>>
>>> In addition, I have started to see the following in the
>>> spool/qmaster/messages file (unrelated?):
>>>
>>> 07/11/2012 13:06:02|listen|serverA|E|commlib error: got read error
>>> (closing "hostA/qstat/29971")
>>>
>>> These appear to be 2 separate problems as one is qrsh and the other
>>> appears to be qstat.
>>>
>>> I am running sge6.2u5
>>> qmaster is running on rhel5
>>> clients are rhel5 and rhel6
>>>
>>> The qrsh issue seems to happen much more frequently on the rhel6 system.
>>>
>>> Thanks for any help in how to trouble shoot this.
>>> --
>>> -MichaelC
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users


More information about the users mailing list