[gridengine users] Parallel jobs failure after OS upgrade

Joshua Baker-LePain jlb at salilab.org
Tue Apr 3 23:10:12 UTC 2012


On Wed, 4 Apr 2012 at 12:30am, Reuti wrote

> Am 04.04.2012 um 00:19 schrieb Joshua Baker-LePain:
>
>> On Wed, 4 Apr 2012 at 12:12am, Reuti wrote
>>
>>> Are you running your jobs across more than one queue? There was an 
>>> issue recently when the hostfile contains more than one queue per 
>>> machine on the Open MPI mailing list with a similar output IIRC.
>>
>> Heh.  That was me, and I'm running version 1.5.5 of Open MPI, which 
>> includes the fix for the multiple queue issue.  And this issue is 
>> completely separate from that one anyway -- that issue casued the MPI 
>> spawned processes to segfault, which isn't happening here.
>
> Not for my tests regarding this issue. The jobs ran, but used only a 
> part of the granted slots were used; and at the end I got this message 
> "Connection to lifeline...".
>
>
>>> So we have two issues: for SGE it's between a slave and the master 
>>> machines. But for your job it's between the slaves - right?
>>
>> Yes.  We have the SGE commlib errors, and the Open MPI 
>> "routed:binomial" errors.  I'm mainly focusing on the SGE problem right 
>> now, as I think (hope) that fixing that will also fix the MPI issue.
>
> Does it also happen with an mpihello job?

Actually, yes.  I see commlib errors in jobs which successfully complete, 
and in those I do *not* see "Connection to lifeline" errors.  Those latter 
errors pop up when a hung job hits h_rt and gets killed by SGE.  So I 
think those are more a symptom than a cause.

So the main questions remain a) why am I seeing these commlib errors and 
b) why do some jobs run anyway while others fail?  I'm assuming that the 
latter is due to SGE retrying the qrsh call a limited number of times.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the users mailing list