[gridengine users] Lost qrsh jobs

Reuti reuti at staff.uni-marburg.de
Thu Nov 22 15:22:10 UTC 2012


Am 21.11.2012 um 19:44 schrieb François-Michel L'Heureux:

> On Wed, Nov 21, 2012 at 12:14 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 21.11.2012 um 18:06 schrieb François-Michel L'Heureux:
> 
> > On Wed, Nov 21, 2012 at 11:59 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:
> >
> > > Hi!
> > >
> > > Thanks for the reply.
> > >
> > > No, the job did not run. My launch command sets the verbose flag  and -now no. The first thing I get is
> > > waiting for interactive job to be scheduled ...
> >
> > Yep, with -now n it will wait until resources are available. The default behavior would be to fail more or less instantly.
> >
> >
> > > Which is good. Then nothing happens. Later, when I kill the jobs, I see a mix of some
> > > Your "qrsh" request could not be scheduled, try again later. popping in my logs.
> > > and
> > > error: commlib error: got select error (No route to host)
> > > and
> >
> > Is there a route to the host?
> > Yes
> >
> >
> > > error: commlib error: got select error (Connection timed out)
> > >
> > > It's strange that this is only received after the kill.
> > >
> > > From my terminal experience, qrsh can behave in a weird manner. When I get an error message, the qrsh job is queued (and showed in qstat), but I lose my handle over it.
> > >
> > > Regarding the dynamic cluster, my IPs are static for the duration of a node life. Nodes can be added and removed. Their IPs won't change in the middle of a run. But say that node3 is added with an IP, then removed, then added back, the IP will not be the same. Might it be the cause?
> >
> > For SGE it would be a different node then with a different name. What's the reason for adding and removing nodes?
> > We are working over Amazon with spot instances. We add/remove node based on the queue size and other factors.
> >
> > -- Reuti
> >
> > I'm onto something. When a job fails and the status is set to "Eqw", does it stay eternally into qstat output or does it get removed at some point? If they go away, that would explain the issue.
> 
> It will stay in Eqw until you either delete the job or clear the flag with `qmod -cj <jobid>`. 
> 
> >
> > Also, in case it gives you any hint, when I run
> > qacct -j | grep failed
> >
> > I can see the following failures
> > 100 : assumedly after job
> 
> This means to set the job into error state.

If it exits exactly with 100 => error state, and 99 => reschedule the job.


> Is this intended to exit the job script with this error code? You will get more than one entry in the accounting file when the job is rerun.
> I don't understand what you mean there. I have control over this? My tests shows that if I call "kill -9" on the process, that's what happens, but in qacct -j it appears more often than I did kill jobs. What else can cause it?

Any epilog exiting with 100?

-- Reuti


> > 37  : qmaster enforced h_rt limit
> 
> Well, if h_rt is exceeded it's no wonder that it's killed and as a result qrsh lost contact as the process on the node is killed, not the `qrsh` on the login machine.
> Ok this one comes from when an execution node goes away, the job is deleted with qdel and this becomes the failed code.
> 
> -- Reuti
> 
> I'm trying to reproduce the issue anyway I can think of. My best lead was if Eqw disappears after a while  but if it doesn't, I have to look somewhere else.
> 
> 
> 
> > > Thanks
> > > Mich
> > >
> > >
> > > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > Hi,
> > >
> > > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> > >
> > > > I have an issue where some jobs I call with the qrsh commands never appear into the queue. If I run the command "ps -ef | grep qrsh" I can see them. My setup
> > >
> > > Ok, but did it ever start on any node?
> > >
> > >
> > > > is as follows:
> > > >
> > > >       • I just have one process calling the grid engine via qrsh. This process resides on the master node.
> > > >       • I don't use nfs, I use sshfs instead.
> > > >       • I run over a dynamic cluster, which mean that at anytime nodes can be added or removed.
> > > > Is anyone having an idea on what can cause the issue? I can counter it by looking at the process list when the queue is empty and killing/rescheduling those running a qrsh command, but I would rather prevent it.
> > >
> > > What do you mean by "dynamic cluster". SGE needs fixed addresses per node.
> > >
> > > -- Reuti
> > >
> > >
> > > > Thanks
> > > > Mich
> > > > _______________________________________________
> > > > users mailing list
> > > > users at gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
> 
> 




More information about the users mailing list