[gridengine users] Parallel jobs failure after OS upgrade
Hung-Sheng Tsao (LaoTsao) Ph.D
laotsao at gmail.com
Wed Apr 4 01:36:31 UTC 2012
is SElinux on or off?
Sent from my iPad
On Apr 3, 2012, at 19:43, Rayson Ho <rayson at scalablelogic.com> wrote:
> Is it possible that some nodes have a firewall running while some don't??
> On Tue, Apr 3, 2012 at 3:49 PM, Joshua Baker-LePain <jlb at salilab.org> wrote:
>> I run a moderately sized cluster (~600 nodes with ~4000 cores, of wildly
>> mixed vintages) on SGE 6.1u3 (which, yes, is rather ancient). Until
>> recently, both the master and all the nodes were running CentOS 5 (5.7, to
>> be precise). I upgraded the nodes to CentOS 6.2, but didn't touch the
>> master. Our job load is mainly large numbers of single slot jobs, but we do
>> have some users running parallel code.
>> Since the upgrade, parallel jobs have been failing at a fairly high rate.
>> Using Open MPI as the parallel library, the SGE error files of the jobs
>> report varying numbers of this error:
>> error: commlib error: can't connect to service (Connection timed out)
>> Sometimes a job will report that error and seem to still run, and other
>> times it won't report the error but will fail. Still, it seems like
>> something new that shouldn't be happening. Also, AFAICT, there are no
>> corresponding messages in $SGE_ROOT/spool/qmaster/messages.
>> Does anyone have any ideas as to why I would be seeing this error (and why
>> it would be so much more frequent after the exec node OS upgrade)? Any
>> ideas on how to track it down? I'm admittedly at a bit of a loss here.
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> users mailing list
>> users at gridengine.org
> users mailing list
> users at gridengine.org
More information about the users