[gridengine users] issue with qrsh "waiting on socket fd 4" in SGE 6.2u5

Manfred Selz Manfred.Selz at diasemi.com
Tue Nov 15 14:14:03 UTC 2016


Hi,

similar issues have been reported a long time ago, but I haven't seen a recent solution to this.

In one of our company's SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on selected hosts with messages like this:

$  qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose
...
Your job 1756874 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired while waiting on socket fd 4

Your interactive job 1756874 has been successfully scheduled.
timeout (5 s) expired while waiting on socket fd 4

This goes for some time, the jobs can even be seen briefly via qstat - however, the jobs never really kick in, switch themselves to "dr" stated and are finally gone (after a minute or so).
The exec host's messages file has lines like this:

11/15/2016 05:59:50|  main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 signal: KILL

The main messages file has this:

11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876 for deletion
11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job 1756876.1
11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died through signal KILL (9)

Until a few days ago, qrsh used to work on all hosts in the cluster, and this suddenly stopped for most (but not all!) of them, without a deliberate change in SGE config or host config (for instance, "uptime" confirms that the hosts have not been recently rebooted. Otherwise, the hosts in the cluster are all of same type (hardware), kernel version, etc., with no significant difference I have been able to identify yet.

For the same hosts, also a "qsub -now y" fails.

I have verified proper sge execd operation and host identification with "qping", "gethostbyaddr", and "gethostbyname", and this looks all fine.

Currently I am quite puzzled - I'd appreciate any input somebody may have on how to further debug or resolve.

Best regards,
Manfred


________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.

Please consider the environment before printing this e-mail


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20161115/df15ec3d/attachment.html>


More information about the users mailing list