[gridengine users] Intermittent commlib errors with MPI jobs

Brendan Moloney moloney at ohsu.edu
Tue Nov 13 23:56:00 UTC 2012


Ok I will test that out once I can schedule some down time.  I might even be able to get my hands on another switch by then.

I appreciate all the help.
________________________________________
From: Reuti [reuti at Staff.Uni-Marburg.DE]
Sent: Tuesday, November 13, 2012 3:33 AM
To: Brendan Moloney
Cc: users at gridengine.org
Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs

Am 12.11.2012 um 22:03 schrieb Brendan Moloney:

> I suppose it could be the switch.  Is the only way to test this to swap it out for a different switch?

Are all ports used on the switch? Change the used ports.

-- Reuti


> Thanks again,
> Brendan
> ________________________________________
> From: Reuti [reuti at staff.uni-marburg.de]
> Sent: Monday, November 12, 2012 4:17 AM
> To: Brendan Moloney
> Cc: users at gridengine.org
> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs
>
> Am 10.11.2012 um 00:31 schrieb Brendan Moloney:
>
>> I spent some time researching this issue in the context of OpenSSH and found some mentions of similar problems due to the initial handshake package being too large (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer).  I was dubious that this was my problem but after manually specifying the cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the number of submissions I have done now I would expect to have seen the issue several times, so I am fairly sure it is fixed.  Will keep an eye on it of course.
>>
>>>>>> Sometimes I get "Connection reset by peer"
>>>
>>> After a long time or instantly? There are some setting in ssh to avoid a timeout in ssh_config resp. ~/.ssh/config:
>>>
>>> Host *
>>>  Compression yes
>>>  ServerAliveInterval 900
>>
>> Seems to happen fast enough that it is not a timeout issue.
>>
>>>> I am indeed using SSH with a wrapper script for adding the group ID:
>>>>
>>>> qlogin_command               /usr/global/bin/qlogin-wrapper
>>>> qlogin_daemon                /usr/global/bin/rshd-wrapper
>>>> rlogin_command               /usr/bin/ssh
>>>> rlogin_daemon                /usr/global/bin/rshd-wrapper
>>>> rsh_command                  /usr/bin/ssh
>>>> rsh_daemon                   /usr/global/bin/rshd-wrapper
>>
>>> It's also possible to set different methods for each of the three pairs. So, rsh_command/rsh_daemon could be set to builtin and the others left as they are. Would this be appropriate for your intended setup of X11 forwarding?
>>
>> So using the builtin option would still allow enforcement of memory/time limits on parallel jobs?
>
> The ones set by SGE - yes.
>
> To the original problem: can it be a problem in the switch?
>
> -- Reuti
>





More information about the users mailing list