[gridengine users] Debugging a commlib error following reboot of exec host
jlb at salilab.org
Tue Jul 3 18:30:20 UTC 2018
On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote
> We're using SGE 8.1.9 on CentOS 6.9
> "All of the sudden" we've noticed that when we reboot an execution host,
> any jobs sent to it within the first 10-15 min following boot-up will
> get stuck in the 't' state until deleted (sometimes that has to be done
> forcibly). However, after 10-ish minutes, the execution host will start
> accepting jobs.
> In the qmaster's messages file, I see the following entries:
> 06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint "sim4.work.com/execd/1" is already connected)
> 06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue "short.q at sim4.work.com"
> 06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host "sim4.work.com", can't deliver job "54312"
One possibility occurs to me. SoGE 8.1.9 has a bug where "qconf -s"
commands fail on non-admin hosts (see
<https://arc.liv.ac.uk/trac/SGE/ticket/1576>). One side-effect of this is
that the init script fails to properly shutdown the execd. I'm wondering
if that's leading to your problem. I don't see this, but I'm running on
CentOS-7, which may lead to some different behavior.
QB3 Shared Cluster Sysadmin
More information about the users