[gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain jlb at salilab.org
Wed Apr 11 19:11:23 UTC 2012


On Wed, 11 Apr 2012 at 10:46am, orlando.richards at ed.ac.uk wrote

> We ran into a problem with infiniband based MPI jobs caused by a change in 
> the default max locked memory ulimit which init-spawned processes start with, 
> between RHEL5 and RHEL6.
>
> If you run a job through the old and new environments which just does "ulimit 
> -a", do you see a difference? Particularly - do you see a difference in the 
> max locked memory (ulimit -l)?

I don't have any C5 hosts left in my real cluster, but I was able to 
dredge up my old VirtualBox test cluster and run ulimit on both.  The 
differences are (sorry for any bad table formatting):

resource          CentOS-5 value    CentOS-6 value
pending signals             6143             30507
max locked memory             32                64
max user processes          6143             30507

So all the resource limits *increased* going from C5 to C6.

> Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd init 
> script, immediately before the sge_execd startup command. In our case, 
> "unlimited" is the required value as per the QLogic infiniband setup process.

I tried this anyway and still saw the same failures.  Thanks for having a 
look, though.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the users mailing list