[gridengine users] Debugging a commlib error following reboot of exec host

Mun Johl mun.johl at kazan-networks.com
Mon Jun 25 18:27:24 UTC 2018


We're using SGE 8.1.9 on CentOS 6.9

"All of the sudden" we've noticed that when we reboot an execution host,
any jobs sent to it within the first 10-15 min following boot-up will
get stuck in the 't' state until deleted (sometimes that has to be done
forcibly).  However, after 10-ish minutes, the execution host will start
accepting jobs.

In the qmaster's messages file, I see the following entries:

06/25/2018 10:28:15|listen|sim1|E|commlib error: endpoint is not unique error (endpoint "sim4.work.com/execd/1" is already connected)
06/25/2018 10:38:36| timer|sim1|W|failed to deliver job 54312.1 to queue "short.q at sim4.work.com"
06/25/2018 10:38:36| timer|sim1|E|got max. unheard timeout for target "execd" on host "sim4.work.com", can't deliver job "54312"

Our IT person says he can connect to the SGE ports on both the qmaster
and exec hosts without issue.

I need some help trying to figure out exactly why the SGE qmaster is not
happy so that we can deploy a fix.  I am _assuming_ some kind of
DNS/Network issue on our end.  This phenomenon is repeatable on all of
our execution hosts (although, our server count is small at this point).
I am told by IT that nothing has changed regarding DNS from when SGE
execution hosts worked "correctly" following a reboot to now.



More information about the users mailing list