[gridengine users] Lost qrsh jobs

François-Michel L'Heureux fmlheureux at datacratic.com
Thu Nov 22 18:04:25 UTC 2012


On Thu, Nov 22, 2012 at 12:57 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Am 22.11.2012 um 16:31 schrieb François-Michel L'Heureux:
>
> > On Thu, Nov 22, 2012 at 10:22 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > Am 21.11.2012 um 19:44 schrieb François-Michel L'Heureux:
> >
> > > On Wed, Nov 21, 2012 at 12:14 PM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > > Am 21.11.2012 um 18:06 schrieb François-Michel L'Heureux:
> > >
> > > > On Wed, Nov 21, 2012 at 11:59 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > > > Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:
> > > >
> > > > > Hi!
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > No, the job did not run. My launch command sets the verbose flag
>  and -now no. The first thing I get is
> > > > > waiting for interactive job to be scheduled ...
> > > >
> > > > Yep, with -now n it will wait until resources are available. The
> default behavior would be to fail more or less instantly.
> > > >
> > > >
> > > > > Which is good. Then nothing happens. Later, when I kill the jobs,
> I see a mix of some
> > > > > Your "qrsh" request could not be scheduled, try again later.
> popping in my logs.
> > > > > and
> > > > > error: commlib error: got select error (No route to host)
> > > > > and
> > > >
> > > > Is there a route to the host?
> > > > Yes
> > > >
> > > >
> > > > > error: commlib error: got select error (Connection timed out)
> > > > >
> > > > > It's strange that this is only received after the kill.
> > > > >
> > > > > From my terminal experience, qrsh can behave in a weird manner.
> When I get an error message, the qrsh job is queued (and showed in qstat),
> but I lose my handle over it.
> > > > >
> > > > > Regarding the dynamic cluster, my IPs are static for the duration
> of a node life. Nodes can be added and removed. Their IPs won't change in
> the middle of a run. But say that node3 is added with an IP, then removed,
> then added back, the IP will not be the same. Might it be the cause?
> > > >
> > > > For SGE it would be a different node then with a different name.
> What's the reason for adding and removing nodes?
> > > > We are working over Amazon with spot instances. We add/remove node
> based on the queue size and other factors.
> > > >
> > > > -- Reuti
> > > >
> > > > I'm onto something. When a job fails and the status is set to "Eqw",
> does it stay eternally into qstat output or does it get removed at some
> point? If they go away, that would explain the issue.
> > >
> > > It will stay in Eqw until you either delete the job or clear the flag
> with `qmod -cj <jobid>`.
> > >
> > > >
> > > > Also, in case it gives you any hint, when I run
> > > > qacct -j | grep failed
> > > >
> > > > I can see the following failures
> > > > 100 : assumedly after job
> > >
> > > This means to set the job into error state.
> >
> > If it exits exactly with 100 => error state, and 99 => reschedule the
> job.
> >
> >
> > > Is this intended to exit the job script with this error code? You will
> get more than one entry in the accounting file when the job is rerun.
> > > I don't understand what you mean there. I have control over this? My
> tests shows that if I call "kill -9" on the process, that's what happens,
> but in qacct -j it appears more often than I did kill jobs. What else can
> cause it?
> >
> > Any epilog exiting with 100?
> >
> > I'm not sure I get the concept of epilog. It's jobs that run at the end?
>
> No, it's defined on a queue or global level and will run after the
> job(-script) finished (`man queue_conf`resp. `man sge_conf`). Therein
> certain clean-up procedures or alike can be defined.
>

Ah ok got it. No, there is no epilog script nor prolog.


>
> > Then no. After killing the stuck jobs, the app automatically reschedules
> them and all went well from there.
>
> Are they getting a new job number?
>

Yes. The job is killed and my app makes a new qrsh call.


> -- Reuti
>
>
> > I made an alteration to my sshfs mount. I read stuff about sync issues
> so I added the sync_write flag. Last night the error didn't occur. It
> doesn't always occur so it's still to fast to say I fixed it.
> >
> > Thanks for the follow up!
> >
> >
> > -- Reuti
> >
> >
> > > > 37  : qmaster enforced h_rt limit
> > >
> > > Well, if h_rt is exceeded it's no wonder that it's killed and as a
> result qrsh lost contact as the process on the node is killed, not the
> `qrsh` on the login machine.
> > > Ok this one comes from when an execution node goes away, the job is
> deleted with qdel and this becomes the failed code.
> > >
> > > -- Reuti
> > >
> > > I'm trying to reproduce the issue anyway I can think of. My best lead
> was if Eqw disappears after a while  but if it doesn't, I have to look
> somewhere else.
> > >
> > >
> > >
> > > > > Thanks
> > > > > Mich
> > > > >
> > > > >
> > > > > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <
> reuti at staff.uni-marburg.de> wrote:
> > > > > Hi,
> > > > >
> > > > > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> > > > >
> > > > > > I have an issue where some jobs I call with the qrsh commands
> never appear into the queue. If I run the command "ps -ef | grep qrsh" I
> can see them. My setup
> > > > >
> > > > > Ok, but did it ever start on any node?
> > > > >
> > > > >
> > > > > > is as follows:
> > > > > >
> > > > > >       • I just have one process calling the grid engine via
> qrsh. This process resides on the master node.
> > > > > >       • I don't use nfs, I use sshfs instead.
> > > > > >       • I run over a dynamic cluster, which mean that at anytime
> nodes can be added or removed.
> > > > > > Is anyone having an idea on what can cause the issue? I can
> counter it by looking at the process list when the queue is empty and
> killing/rescheduling those running a qrsh command, but I would rather
> prevent it.
> > > > >
> > > > > What do you mean by "dynamic cluster". SGE needs fixed addresses
> per node.
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > Thanks
> > > > > > Mich
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > users at gridengine.org
> > > > > > https://gridengine.org/mailman/listinfo/users
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20121122/6409394b/attachment-0001.html>


More information about the users mailing list