[gridengine users] Parallel jobs failure after OS upgrade
jlb at salilab.org
Wed Apr 11 19:11:23 UTC 2012
On Wed, 11 Apr 2012 at 10:46am, orlando.richards at ed.ac.uk wrote
> We ran into a problem with infiniband based MPI jobs caused by a change in
> the default max locked memory ulimit which init-spawned processes start with,
> between RHEL5 and RHEL6.
> If you run a job through the old and new environments which just does "ulimit
> -a", do you see a difference? Particularly - do you see a difference in the
> max locked memory (ulimit -l)?
I don't have any C5 hosts left in my real cluster, but I was able to
dredge up my old VirtualBox test cluster and run ulimit on both. The
differences are (sorry for any bad table formatting):
resource CentOS-5 value CentOS-6 value
pending signals 6143 30507
max locked memory 32 64
max user processes 6143 30507
So all the resource limits *increased* going from C5 to C6.
> Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd init
> script, immediately before the sge_execd startup command. In our case,
> "unlimited" is the required value as per the QLogic infiniband setup process.
I tried this anyway and still saw the same failures. Thanks for having a
QB3 Shared Cluster Sysadmin
More information about the users