[gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain jlb at salilab.org
Wed Apr 4 20:37:56 UTC 2012


On Wed, 4 Apr 2012 at 6:33pm, Tru Huynh wrote

> On Tue, Apr 03, 2012 at 03:19:51PM -0700, Joshua Baker-LePain wrote:
>>
>> Yes.  We have the SGE commlib errors, and the Open MPI
>> "routed:binomial" errors.  I'm mainly focusing on the SGE problem
>> right now, as I think (hope) that fixing that will also fix the MPI
>> issue.
>
> could it be related to NFS (locking?) between your CentOS-6 clients
> and NFS shared SGE directory?
>
> or readdir failure such as:
> http://bugs.centos.org/view.php?id=5496

Aside: Wow, NFS in 6.2 seems rather wonky.  We've also hit this 
<https://bugzilla.redhat.com/show_bug.cgi?id=770250>.

That being said, our SGE directory isn't NFS shared.  We use local spool 
directories and local SGE installations on all the nodes.  The only thing 
that's NFS mounted is $SGE_ROOT/$SGE_CELL/common so that we can have a 
shadow master.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the users mailing list