[gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain jlb at salilab.org
Wed Apr 4 20:37:56 UTC 2012

On Wed, 4 Apr 2012 at 6:33pm, Tru Huynh wrote

> On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote:
>> Yes.  We have the SGE commlib errors, and the Open MPI
>> "routed:binomial" errors.  I'm mainly focusing on the SGE problem
>> right now, as I think (hope) that fixing that will also fix the MPI
>> issue.
> could it be related to NFS (locking?) between your CentOS-6 clients
> and NFS shared SGE directory?
> or readdir failure such as:
> http://bugs.centos.org/view.php?id=5496

Aside: Wow, NFS in 6.2 seems rather wonky.  We've also hit this 

That being said, our SGE directory isn't NFS shared.  We use local spool 
directories and local SGE installations on all the nodes.  The only thing 
that's NFS mounted is $SGE_ROOT/$SGE_CELL/common so that we can have a 
shadow master.

Joshua Baker-LePain
QB3 Shared Cluster Sysadmin

More information about the users mailing list