[gridengine users] commlib

Coleman, Marcus [JRDUS Non-J&J] mcolem19 at its.jnj.com
Sun Nov 27 02:23:51 UTC 2016


Hi Reuti

I am not sure what I am looking for...but here is the contents of /tmp on the rebooting node
Any outrights you can see?

[root at padme tmp]# ls -l
total 20
prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:09 jmonitor.mcolem19.37995
prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:35 jmonitor.mcolem19.38497
prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38615
prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38624
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28331
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28377
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:40 jmonitor.schrogpu.31781
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:41 jmonitor.schrogpu.31829
prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5042
prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5043
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:08 jmonitor.schrogpu.8041
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8220
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:26 jmonitor.schrogpu.8346
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8557
prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.8740
drwx------  2 root     root     4096 Nov  4 16:09 keyring-6CWKlB
drwxrwxrwx  2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock
prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28352
prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28400
prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28480
prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28487
prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31802
prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31850
prw-------  1 schrogpu schrogpu    0 Sep  5 00:40 mmjob.schrogpu.31876
prw-------  1 schrogpu schrogpu    0 Sep  5 00:41 mmjob.schrogpu.31891
prw-------  1 schrogpu schrogpu    0 Sep  5 00:08 mmjob.schrogpu.8087
prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8266
prw-------  1 schrogpu schrogpu    0 Sep  5 00:26 mmjob.schrogpu.8392
prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8603
prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.8787
drwx------  2 gdm      gdm      4096 Nov 25 07:42 orbit-gdm
drwx------. 2 gdm      gdm      4096 Nov 25 07:42 pulse-5mlDwNemaGym
drwx------  2 root     root     4096 Nov  4 16:09 pulse-GAI9xhuCTgeg
[root at padme tmp]#


-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Saturday, November 26, 2016 6:31 AM
To: Coleman, Marcus [JRDUS Non-J&J]
Cc: users at gridengine.org
Subject: [EXTERNAL] Re: [gridengine users] commlib

Hi,

Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]:

> I am having an issue with a node rebooting. I am running Desmond fep 
> jobs...
>  
> Thanks for any help in advance!
>  
> /etc/resolv.conf is the same on all nodes /etc/hosts is the same on 
> all nodes All nodes are connected to the same switch in a server rack.
>  
>  
> Qping from master to node
> [root at rndusljpp2 lx-amd64]# qping padme 6445 execd 1
> 11/25/2016 20:57:26 endpoint padme/execd/1 at port 6445 is up for 
> 16733 seconds
> 11/25/2016 20:57:27 endpoint padme/execd/1 at port 6445 is up for 
> 16734 seconds
> 11/25/2016 20:57:28 endpoint padme/execd/1 at port 6445 is up for 
> 16735 seconds
> 11/25/2016 20:57:29 endpoint padme/execd/1 at port 6445 is up for 
> 16736 seconds
> 11/25/2016 20:57:30 endpoint padme/execd/1 at port 6445 is up for 
> 16737 seconds
> 11/25/2016 20:57:31 endpoint padme/execd/1 at port 6445 is up for 
> 16738 seconds
>  
> Qping from node to master
> [root at padme ~]# qping s1 6444 qmaster 1
> 11/25/2016 20:59:10 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 
> 6444 is up for 2440537 seconds
> 11/25/2016 20:59:11 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 
> 6444 is up for 2440538 seconds
> 11/25/2016 20:59:12 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 
> 6444 is up for 2440539 seconds
> 11/25/2016 20:59:13 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 
> 6444 is up for 2440540 seconds
> 11/25/2016 20:59:14 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 
> 6444 is up for 2440541 seconds
>  
> ################### from NODE
> [root at padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 
> rndusljpp2.na.jnj.com [root at padme lx-amd64]# ./gethostbyname -name s1 
> rndusljpp2.na.jnj.com ################### from QMASTER
> [root at rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 padme
> [root at rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme
>  
>  
> ############# NODE SGE logs
>  
> 11/25/2016 07:38:56|  main|padme|I|restarting load 
> sensor/opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:38:56|  main|padme|W|[load_sensor 6137] fflush failed 
> [Broken pipe]
> 11/25/2016 07:38:57|  main|padme|W|load sensor exited with exit status 
> = 1
> 11/25/2016 07:39:36|  main|padme|I|restarting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:39:36|  main|padme|W|[load_sensor 6139] fflush failed 
> [Broken pipe]
> 11/25/2016 07:39:37|  main|padme|W|load sensor exited with exit status 
> = 1
> 11/25/2016 07:41:58|  main|padme|I|starting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:41:58|  main|padme|I|registered at qmasterhost "rndusljpp2.na.jnj.com"
> 11/25/2016 07:41:58|  main|padme|I|starting up SGE 8.1.8(lx-amd64)
> 11/25/2016 07:41:58|  main|padme|I|memory accounting inaccurate with 
> USE_SMAPS=false
> 11/25/2016 07:41:58|  main|padme|I|successfully started PDC and PTF
> 11/25/2016 07:41:58|  main|padme|I|checking for old jobs
> 11/25/2016 07:41:58|  main|padme|I|no old jobs at startup
> 11/25/2016 07:41:59|  main|padme|W|load sensor exited with exit status 
> = 1
> 11/25/2016 07:42:38|  main|padme|I|restarting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:42:38|  main|padme|W|[load_sensor 5111] fflush failed 
> [Broken pipe]
>  
> ############# QMASTER log
> 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: endpoint is not 
> unique error (endpoint "padme/execd/1" is already connected)
> 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: got select 
> error (Connection reset by peer)
> 11/25/2016 07:41:29|worker|rndusljpp2|I|execd on padme registered

Are there any files in /tmp on the node pointing to a problem starting execd?

-- Reuti




More information about the users mailing list