[gridengine users] Intermittent commlib errors with MPI jobs

Brendan Moloney moloney at ohsu.edu
Sat Nov 10 05:00:55 UTC 2012


And of course the error comes up again after sending the previous email...

However, I can report that this issue is not SSH related. I tried the 'builtin' option for the rsh and rlogin commands and I still see the same error.

Any other ideas?

Thanks,
Brendan

________________________________________
From: Brendan Moloney
Sent: Friday, November 09, 2012 3:31 PM
To: Reuti
Cc: users at gridengine.org
Subject: RE: [gridengine users] Intermittent commlib errors with MPI jobs

I spent some time researching this issue in the context of OpenSSH and found some mentions of similar problems due to the initial handshake package being too large (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer).  I was dubious that this was my problem but after manually specifying the cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the number of submissions I have done now I would expect to have seen the issue several times, so I am fairly sure it is fixed.  Will keep an eye on it of course.

>>>> Sometimes I get "Connection reset by peer"
>
>After a long time or instantly? There are some setting in ssh to avoid a timeout in ssh_config resp. ~/.ssh/config:
>
>Host *
>    Compression yes
>    ServerAliveInterval 900

Seems to happen fast enough that it is not a timeout issue.

>> I am indeed using SSH with a wrapper script for adding the group ID:
>>
>> qlogin_command               /usr/global/bin/qlogin-wrapper
>> qlogin_daemon                /usr/global/bin/rshd-wrapper
>> rlogin_command               /usr/bin/ssh
>> rlogin_daemon                /usr/global/bin/rshd-wrapper
>> rsh_command                  /usr/bin/ssh
>> rsh_daemon                   /usr/global/bin/rshd-wrapper

> It's also possible to set different methods for each of the three pairs. So, rsh_command/rsh_daemon could be set to builtin and the others left as they are. Would this be appropriate for your intended setup of X11 forwarding?

So using the builtin option would still allow enforcement of memory/time limits on parallel jobs?

Thanks,
Brendan




More information about the users mailing list