[gridengine users] commlib errors?
skylar2 at u.washington.edu
Thu Jul 12 15:05:18 UTC 2012
We tried doing that, and it fixed most of our issues, but not the SGE
ones. We're seeing very high CPU load on the sge_shepherd and sge_execd
processes. We were seeing high load on our sge_qmaster process until we
-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
On 07/12/12 07:59 AM, Adam Tygart wrote:
> In my experience, a workaround to the leapsecond bug is to run
> date -s "`date`"
> on all of the hosts in the cluster. This caused all of the services
> experiencing issues to calm down and return to normal without
> restarting any services.
> Adam Tygart
> Beocat Sysadmin
> On Thu, Jul 12, 2012 at 9:55 AM, Skylar Thompson
> <skylar2 at u.washington.edu> wrote:
>> We've been seeing this issue too. It seemed to be correlated with the leap
>> second that got added on the 7th. RHEL6 has a bug that causes issues for
>> threaded applications. We're working on restarting all our sge_execd's, but
>> not all of them want to softstop properly.
>> See this knowledgebase article for more info:
>> We're not totally certain this is the issue, but it's highly correlated in
>> time with the leap second.
>> -- Skylar Thompson (skylar2 at u.washington.edu)
>> -- Genome Sciences Department, System Administrator
>> -- Foege Building S046, (206)-685-7354
>> -- University of Washington School of Medicine
>> On 07/12/12 07:50 AM, Michael Coffman wrote:
>>> I am intermittently seeing the following on the command line when
>>> attempting to run qrsh with out any options:
>>> error: error running IJS server: "can't create tty_to_commlib thread:
>>> timeout while waiting for thread start"
>>> In addition, I have started to see the following in the
>>> spool/qmaster/messages file (unrelated?):
>>> 07/11/2012 13:06:02|listen|serverA|E|commlib error: got read error
>>> (closing "hostA/qstat/29971")
>>> These appear to be 2 separate problems as one is qrsh and the other
>>> appears to be qstat.
>>> I am running sge6.2u5
>>> qmaster is running on rhel5
>>> clients are rhel5 and rhel6
>>> The qrsh issue seems to happen much more frequently on the rhel6 system.
>>> Thanks for any help in how to trouble shoot this.
>>> users mailing list
>>> users at gridengine.org
>> users mailing list
>> users at gridengine.org
More information about the users