[gridengine users] Parallel jobs failure after OS upgrade
jlb at salilab.org
Tue Apr 3 23:10:12 UTC 2012
On Wed, 4 Apr 2012 at 12:30am, Reuti wrote
> Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:
>> On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
>>> Are you running your jobs across more than one queue? There was an
>>> issue recently when the hostfile contains more than one queue per
>>> machine on the Open MPI mailing list with a similar output IIRC.
>> Heh. That was me, and I'm running version 1.5.5 of Open MPI, which
>> includes the fix for the multiple queue issue. And this issue is
>> completely separate from that one anyway -- that issue casued the MPI
>> spawned processes to segfault, which isn't happening here.
> Not for my tests regarding this issue. The jobs ran, but used only a
> part of the granted slots were used; and at the end I got this message
> "Connection to lifeline...".
>>> So we have two issues: for SGE it's between a slave and the master
>>> machines. But for your job it's between the slaves - right?
>> Yes. We have the SGE commlib errors, and the Open MPI
>> "routed:binomial" errors. I'm mainly focusing on the SGE problem right
>> now, as I think (hope) that fixing that will also fix the MPI issue.
> Does it also happen with an mpihello job?
Actually, yes. I see commlib errors in jobs which successfully complete,
and in those I do *not* see "Connection to lifeline" errors. Those latter
errors pop up when a hung job hits h_rt and gets killed by SGE. So I
think those are more a symptom than a cause.
So the main questions remain a) why am I seeing these commlib errors and
b) why do some jobs run anyway while others fail? I'm assuming that the
latter is due to SGE retrying the qrsh call a limited number of times.
QB3 Shared Cluster Sysadmin
More information about the users