[gridengine users] Intermittent commlib errors with MPI jobs
moloney at ohsu.edu
Thu Nov 8 04:11:40 UTC 2012
I have MPICH2 tightly integrated with OGS 2011.11. Everything is working great in general. I have noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each using two cores) that I will get intermittent commlib errors like:
commlib error: got select error (Broken pipe)
executing task of job 138060 failed: failed sending task to execd at node1.ohsu.edu: can't find connection
Sometimes I get "Connection reset by peer" instead of "Broken pipe". I have the allocation rule set to round robin, so the second process is always spawned on a remote host. The cluster is small, just four servers (72 cores) on gigabit ethernet. The master spool is on NFS while the local spool is on a local drive.
Any advice on how to debug this would be greatly appreciated.
More information about the users