[gridengine users] Odd commlib HOST_NOT_RESOLVABLE error
skylar2 at uw.edu
Fri Aug 24 14:33:32 UTC 2018
Can you do a strace on the command as it's failing? Something like "strace
-e trace=open,connect qstat > /dev/null" should at least give a pointer for
where the failure is occuring. My first thought is that nscd is caching
some negative response for a few minutes, rather than retrying.
On Fri, Aug 24, 2018 at 10:25:42AM -0400, Valerio Luccio wrote:
> Hello all,
> we have a rather old installation of SGE that has been running for years
> without any problems. In the last 2-3 weeks I've been experiencing an
> odd problem: when issuing any command (qsub, qstat, qping, etc) I get
> the following error:
> error: commlib error: access denied (server host resolves destination host "<server address>" as "(HOST_NOT_RESOLVABLE)")
> error: unable to contact qmaster using port 6444 on host "<server address>"
> There are several odd things about this:
> * Nothing has changed on the server or the clients in the months
> before the error started appearing.
> * This happens from most of the clients, but not all.
> * The error persists for 5-10 minutes, and then everything works fine.
> * Both gethostbyname and gethostbyaddr return the correct values from
> the client while the error occurs (I haven't had a chance to try
> them from the master during these episodes).
> I get a feeling that this has something to do with DNS and reverse
> lookup, but I don't know where to start debugging it.
> Anyone have any clue what I should look at ?
> Valerio Luccio (212) 998-8736
> Center for Brain Imaging 4 Washington Place, Room 157
> New York University New York, NY 10003
> "In an open world, who needs windows or gates ?"
> users mailing list
> users at gridengine.org
-- Skylar Thompson (skylar2 at u.washington.edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
More information about the users