[gridengine users] Parallel jobs failure after OS upgrade

Reuti reuti at staff.uni-marburg.de
Tue Apr 3 22:12:26 UTC 2012

Am 03.04.2012 um 23:11 schrieb Joshua Baker-LePain:

> On Tue, 3 Apr 2012 at 10:19pm, Reuti wrote
>> Am 03.04.2012 um 21:49 schrieb Joshua Baker-LePain:
>>> error: commlib error: can't connect to service (Connection timed out)
>> ethtool shows the correct speed for the network interface?
> Yes indeed -- 1000Mb/s across the board.

I asked as there are some realtek chips where you have to compile a r8168.ko on your own as the default isn't working at full speed.

>>> Sometimes a job will report that error and seem to still run, and other times it won't report the error but will fail.
>> The error from the job is different from a timeout - what in detail?
> These jobs are submitted with "-sync y".  For jobs that fail, qsub reports "Unable to run job $JOBID".  The SGE error logs of those jobs usually (but not always) contain commlib errors, but they always contain the following Open MPI errors:
> [opt53:20930] [[6569,0],114] routed:binomial: Connection to lifeline [[6569,0],0] lost

Are you running your jobs across more than one queue? There was an issue recently when the hostfile contains more than one queue per machine on the Open MPI mailing list with a similar output IIRC.

> Looking at the qmaster and relevant execd messages, the jobs that fail are in fact killed b/c they hit their hard wallclock limits.  But they hit that limit without ever using *any* CPU time.  In other words they appear to hang on startup due to the errors, and then SGE kills them when they hit the runtime limit.  Jobs that succeed (same exact binaries and input parameters) complete well within the runtime limit.
>> Do you still use the mpiexec the application was compiled with, or start an old binary with a new mpiexec?
> Everything (MPI and the application) is freshly compiled.

So we have two issues: for SGE it's between a slave and the master machines. But for your job it's between the slaves - right?

-- Reuti

More information about the users mailing list