[gridengine users] Intermittent commlib errors with MPI jobs
moloney at ohsu.edu
Thu Nov 8 09:32:25 UTC 2012
>> I have MPICH2 tightly
>Which version? It should work out-of-the-box with SGE.
Version is 1.4 and yes it does have built in integration.
>> integrated with OGS 2011.11. Everything is working great in general. I have noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each using two cores) that I will get intermittent commlib errors like:
>> commlib error: got select error (Broken pipe)
>> executing task of job 138060 failed: failed sending task to execd at node1.ohsu.edu: can't find connection
>This sounds like a network problem unrelated to SGE. Do you use a private network inside the cluster or can you outline the network configuration - do you have a dedicated switch for the cluster?
Dedicated switch. One node is elsewhere on the LAN, but I see this error come up between two nodes on the dedicated switch. None of the nodes show packet errors.
>> Sometimes I get "Connection reset by peer"
>Which startup of slave tasks do you use, i.e.:
>$ qconf -sconf
>It sound like an SSH problem with your mentioned output above and your settings could be different.
I am indeed using SSH with a wrapper script for adding the group ID:
>> instead of "Broken pipe". I have the allocation rule set to round robin, so the second process is always spawned on a remote host.
>For small jobs I would configure it to run on only one machine - unless they create large scratch files.
Yes but I would like to have a single MPI parallel environment, and in general round robin is the best option for my setup.
More information about the users