[gridengine users] Parallel jobs failure after OS upgrade

orlando.richards at ed.ac.uk orlando.richards at ed.ac.uk
Wed Apr 11 09:46:08 UTC 2012


> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
> recently, both the master and all the nodes were running CentOS 5 (5.7,
> to be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
> master. Our job load is mainly large numbers of single slot jobs, but we
> do have some users running parallel code.
>
> Since the upgrade, parallel jobs have been failing at a fairly high
> rate. Using Open MPI as the parallel library, the SGE error files of the
> jobs report varying numbers of this error:
>
> error: commlib error: can't connect to service (Connection timed out)
>
> Sometimes a job will report that error and seem to still run, and other
> times it won't report the error but will fail. Still, it seems like
> something new that shouldn't be happening. Also, AFAICT, there are no
> corresponding messages in $SGE_ROOT/spool/qmaster/messages.
>
> Does anyone have any ideas as to why I would be seeing this error (and 
> why it would be so much more frequent after the exec node OS upgrade)? 
> Any ideas on how to track it down? I'm admittedly at a bit of a loss
> here.
>

Hi Joshua,

We ran into a problem with infiniband based MPI jobs caused by a change in 
the default max locked memory ulimit which init-spawned processes start 
with, between RHEL5 and RHEL6.

If you run a job through the old and new environments which just does 
"ulimit -a", do you see a difference? Particularly - do you see a 
difference in the max locked memory (ulimit -l)?

Our fix for this was to put a "ulimit -l unlimited" in our sgeexecd init 
script, immediately before the sge_execd startup command. In 
our case, "unlimited" is the required value as per the QLogic infiniband 
setup process.


--
Orlando


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




More information about the users mailing list