[gridengine users] Debugging a commlib error following reboot of exec host
mun.johl at kazan-networks.com
Tue Jul 3 20:24:31 UTC 2018
Thank you for your reply.
Please see my comments below.
On Tue, Jul 03, 2018 at 11:30 AM PDT, Joshua Baker-LePain wrote:
> On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote
> > We're using SGE 8.1.9 on CentOS 6.9
> > "All of the sudden" we've noticed that when we reboot an execution host,
> > any jobs sent to it within the first 10-15 min following boot-up will
> > get stuck in the 't' state until deleted (sometimes that has to be done
> > forcibly). However, after 10-ish minutes, the execution host will start
> > accepting jobs.
> > In the qmaster's messages file, I see the following entries:
> > 06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint "sim4.work.com/execd/1" is already connected)
> > 06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue "short.q at sim4.work.com"
> > 06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host "sim4.work.com", can't deliver job "54312"
> One possibility occurs to me. SoGE 8.1.9 has a bug where "qconf -s"
> commands fail on non-admin hosts (see
> <https://arc.liv.ac.uk/trac/SGE/ticket/1576>). One side-effect of this is
> that the init script fails to properly shutdown the execd. I'm wondering
> if that's leading to your problem. I don't see this, but I'm running on
> CentOS-7, which may lead to some different behavior.
Thanks for the suggestion but I don't believe that issue is the root
cause of my problems. I don't see the same error and the host that
experienced the error that I posted is also an Administrative host.
More information about the users