[gridengine users] commlib

Reuti reuti at staff.uni-marburg.de
Tue Nov 29 14:01:33 UTC 2016


> Am 29.11.2016 um 00:17 schrieb Coleman, Marcus [JRDUS Non-J&J] <mcolem19 at its.jnj.com>:
> 
> Reuti
> 
> So it rebooted again without any jobs running...and I don't understand " sgeadmin at rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you see I got added back ???

Yes, there is a auto delete time for users which were added automatically due to a job submission.

$ qconf -suser mcolem19

will show when the next deletion will take place (unless you set it to 0).

$ qconf -suserl

shows all currently known users.

-- Reuti

> 
> 11/27/2016 01:30:04| timer|rndusljpp2|I|sgeadmin at rndusljpp2.na.jnj.com removed "mcolem19" from user list
> 11/27/2016 01:30:04| timer|rndusljpp2|I|sgeadmin at rndusljpp2.na.jnj.com removed "mcolem19" from user list
> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected)
> 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer)
> 11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered
> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected)
> 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer)
> 11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered
> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected)
> 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer)
> 11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered
> 11/28/2016 13:25:54|worker|rndusljpp2|I|sgeadmin at rndusljpp2.na.jnj.com added "mcolem19" to user list
> 
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de] 
> Sent: Monday, November 28, 2016 11:55 AM
> To: Coleman, Marcus [JRDUS Non-J&J]
> Cc: users at gridengine.org
> Subject: [EXTERNAL] Re: [gridengine users] commlib
> 
> 
> Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]:
> 
>> Thanks Reuti! 
>> 
>> I was hoping it was something there....Any ideas on where to go from here?
> 
> What do:
> 
> $ ./gethostbyname -all padme
> $ ./gethostbyaddr -all 192.168.1.159
> 
> show on the node and headnode?
> 
> -- Reuti
> 
> 
>> -----Original Message-----
>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Sunday, November 27, 2016 4:37 AM
>> To: Coleman, Marcus [JRDUS Non-J&J]
>> Cc: users at gridengine.org
>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>> 
>> 
>> Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>> 
>>> Hi Reuti
>>> 
>>> I am not sure what I am looking for...but here is the contents of 
>>> /tmp on the rebooting node Any outrights you can see?
>>> 
>>> [root at padme tmp]# ls -l
>>> total 20
>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:09 jmonitor.mcolem19.37995
>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:35 jmonitor.mcolem19.38497
>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38615
>>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38624
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28331
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28377
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:40 jmonitor.schrogpu.31781
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:41 jmonitor.schrogpu.31829
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5042
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5043
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:08 jmonitor.schrogpu.8041
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8220
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:26 jmonitor.schrogpu.8346
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8557
>>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.8740
>>> drwx------  2 root     root     4096 Nov  4 16:09 keyring-6CWKlB
>>> drwxrwxrwx  2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28352
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28400
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28480
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28487
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31802
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31850
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:40 mmjob.schrogpu.31876
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:41 mmjob.schrogpu.31891
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:08 mmjob.schrogpu.8087
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8266
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:26 mmjob.schrogpu.8392
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8603
>>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.8787
>>> drwx------  2 gdm      gdm      4096 Nov 25 07:42 orbit-gdm
>>> drwx------. 2 gdm      gdm      4096 Nov 25 07:42 pulse-5mlDwNemaGym
>>> drwx------  2 root     root     4096 Nov  4 16:09 pulse-GAI9xhuCTgeg
>> 
>> Thx, I was looking for a file created by the execd in case it faces problems during startup. Such files will be saved in /tmp as last resort for the logfiles. Unfortunately there are none, hence the startup per se was successful.
>> 
>> 
>>> [root at padme tmp]#
>>> 
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Saturday, November 26, 2016 6:31 AM
>>> To: Coleman, Marcus [JRDUS Non-J&J]
>>> Cc: users at gridengine.org
>>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>>> 
>>> Hi,
>>> 
>>> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>>> 
>>>> I am having an issue with a node rebooting. I am running Desmond fep 
>>>> jobs...
>>>> 
>>>> Thanks for any help in advance!
>>>> 
>>>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same on 
>>>> all nodes All nodes are connected to the same switch in a server rack.
>>>> ################### from NODE
>>>> [root at padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 
>>>> rndusljpp2.na.jnj.com [root at padme lx-amd64]# ./gethostbyname -name 
>>>> s1 rndusljpp2.na.jnj.com ################### from QMASTER
>>>> [root at rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 
>>>> padme
>>>> [root at rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme
>> 
>> What do:
>> 
>> $ ./gethostbyname -all padme
>> $ ./gethostbyaddr -all 192.168.1.159
>> 
>> show?
>> 
>> -- Reuti
>> 
> 
> 





More information about the users mailing list