[gridengine users] Debugging a commlib error following reboot of exec host

Joshua Baker-LePain jlb at salilab.org
Tue Jul 3 18:30:20 UTC 2018


On Tue, 26 Jun 2018 at 9:12am, Mun Johl wrote

> We're using SGE 8.1.9 on CentOS 6.9
>
> "All of the sudden" we've noticed that when we reboot an execution host,
> any jobs sent to it within the first 10-15 min following boot-up will
> get stuck in the 't' state until deleted (sometimes that has to be done
> forcibly).  However, after 10-ish minutes, the execution host will start
> accepting jobs.
>
> In the qmaster's messages file, I see the following entries:
>
> 06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint "sim4.work.com/execd/1" is already connected)
> 06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue "short.q at sim4.work.com"
> 06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host "sim4.work.com", can't deliver job "54312"

One possibility occurs to me.  SoGE 8.1.9 has a bug where "qconf -s" 
commands fail on non-admin hosts (see 
<https://arc.liv.ac.uk/trac/SGE/ticket/1576>).  One side-effect of this is 
that the init script fails to properly shutdown the execd.  I'm wondering 
if that's leading to your problem.  I don't see this, but I'm running on 
CentOS-7, which may lead to some different behavior.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the users mailing list