[gridengine users] issue with qrsh "waiting on socket fd 4" in SGE 6.2u5

Reuti reuti at staff.uni-marburg.de
Tue Nov 15 14:30:37 UTC 2016


Hi,

> Am 15.11.2016 um 15:14 schrieb Manfred Selz <Manfred.Selz at diasemi.com>:
> 
> Hi,
>  
> similar issues have been reported a long time ago, but I haven’t seen a recent solution to this.
>  
> In one of our company’s SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on selected hosts with messages like this:
>  
> $  qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose
> ...
> Your job 1756874 ("QRLOGIN") has been submitted                                      
> waiting for interactive job to be scheduled ...timeout (3 s) expired while waiting on socket fd 4
>  
> Your interactive job 1756874 has been successfully scheduled.
> timeout (5 s) expired while waiting on socket fd 4  

Did you enable any firewall in the cluster to block certain ports on the nodes?

-- Reuti


>      This goes for some time, the jobs can even be seen briefly via qstat - however, the jobs never really kick in, switch themselves to “dr” stated and are finally gone (after a minute or so).
> The exec host’s messages file has lines like this:
>  
> 11/15/2016 05:59:50|  main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 signal: KILL
>  
> The main messages file has this:
>  
> 11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876 for deletion
> 11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job 1756876.1
> 11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died through signal KILL (9)
>  
> Until a few days ago, qrsh used to work on all hosts in the cluster, and this suddenly stopped for most (but not all!) of them, without a deliberate change in SGE config or host config (for instance, “uptime” confirms that the hosts have not been recently rebooted. Otherwise, the hosts in the cluster are all of same type (hardware), kernel version, etc., with no significant difference I have been able to identify yet.
>  
> For the same hosts, also a “qsub -now y” fails.
>  
> I have verified proper sge execd operation and host identification with “qping”, “gethostbyaddr”, and “gethostbyname”, and this looks all fine.
>  
> Currently I am quite puzzled - I’d appreciate any input somebody may have on how to further debug or resolve.
>  
> Best regards,
> Manfred
>  
> 
> 
> 
> Dialog Semiconductor GmbH
> Neue Str. 95
> D-73230 Kirchheim
> Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
> Chairman of the Supervisory Board: Rich Beyer
> Commercial register: Amtsgericht Stuttgart: HRB 231181
> UST-ID-Nr. DE 811121668
> 
> Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.
> 
> 
> Please consider the environment before printing this e-mail
>  
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list