[gridengine users] Intermittent commlib errors with MPI jobs

Brendan Moloney moloney at ohsu.edu
Thu Nov 8 04:11:40 UTC 2012


I have MPICH2 tightly integrated with OGS 2011.11.  Everything is working great in general.  I have noticed when I submit a moderate number of small MPI jobs (e.g. 100 jobs each using two cores) that I will get intermittent commlib errors like:

commlib error: got select error (Broken pipe)
executing task of job 138060 failed: failed sending task to execd at node1.ohsu.edu: can't find connection

Sometimes I get "Connection reset by peer" instead of "Broken pipe". I have the allocation rule set to round robin, so the second process is always spawned on a remote host. The cluster is small, just four servers (72 cores) on gigabit ethernet. The master spool is on NFS while the local spool is on a local drive. 

Any advice on how to debug this would be greatly appreciated.


More information about the users mailing list