[gridengine users] Lost qrsh jobs

Dave Love d.love at liverpool.ac.uk
Wed Nov 21 17:34:34 UTC 2012


François-Michel L'Heureux <fmlheureux at datacratic.com> writes:

> Hi!
>
> Thanks for the reply.
>
> No, the job did not run. My launch command sets the verbose flag  and -now
> no. The first thing I get is
> waiting for interactive job to be scheduled ...
>
> Which is good. Then nothing happens. Later, when I kill the jobs, I see a
> mix of some
> Your "qrsh" request could not be scheduled, try again later. popping in my
> logs.

If the qrsh startup fails, it's likely to put the queue into an error
state.  This at least partly sounds consistent with the startup dying on
the execution host.  What OS, GE version, and remote startup method is
this with?  There appear to be system-dependent issues with
multi-threading in qrsh startup, though I'm not sure whether it's
getting that far in this case.

> and
> error: commlib error: got select error (No route to host)
> and
> error: commlib error: got select error (Connection timed out)

Those are potentially different things.

> It's strange that this is only received after the kill.
>
> From my terminal experience, qrsh can behave in a weird manner. When I get
> an error message, the qrsh job is queued (and showed in qstat), but I lose
> my handle over it.

I'm confused by the symptoms.  What does losing the handle over it mean
exactly, if it's queued?

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/




More information about the users mailing list