[gridengine users] Solved: AW: New Execution Host: load_avg = -NA-

RATH Jochen (AREVA) jochen.rath at areva.com
Wed Nov 14 09:33:50 UTC 2012


It was a network setting error.
In my /etc/hosts file, I had the entries
127.0.0.1       localhost.localdomain   localhost.localdomain   localhost4      localhost4.localdomain4 localhost calcuserver03
::1     localhost.localdomain   localhost.localdomain   localhost6      localhost6.localdomain6 localhost calcuserver03

With these entries I think, he tried to send information with the localhost.localdomain and got an error. After I change th /etc/hosts file to
127.0.0.1       localhost.localdomain   localhost.localdomain   localhost4      localhost4.localdomain4 localhost 
::1     localhost.localdomain   localhost.localdomain   localhost6      localhost6.localdomain6 localhost 
11.53.103.149 calcuserver03

The qmaster get the needed information from the calculation server:
[root@ master ge2011.11]# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
calcuserver03 linux-x64      32  0.00  126.0G    1.6G   50.0G   46.9M

Mayby this will help other with similar problems.

Regards
   Jochen

-----Ursprüngliche Nachricht-----
Von: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] Im Auftrag von RATH Jochen (AREVA Wind GmbH)
Gesendet: Dienstag, 13. November 2012 15:32
An: Reuti
Cc: users at gridengine.org
Betreff: Re: [gridengine users] New Execution Host: load_avg = -NA-

Hello

No, no older deamon. The error occured, because I tried to stop the sge_exced deamon with the wrong command.

Regards
  Jochen

-----Ursprüngliche Nachricht-----
Von: Reuti [mailto:reuti at Staff.Uni-Marburg.DE] 
Gesendet: Dienstag, 13. November 2012 15:31
An: RATH Jochen (AREVA Wind GmbH)
Cc: users at gridengine.org
Betreff: Re: AW: [gridengine users] New Execution Host: load_avg = -NA-

Am 13.11.2012 um 14:25 schrieb RATH Jochen (AREVA):

> Thanks for your replay.
> When I look for SGE with ps, it is still running:
> [jrath@ calcuserver03 tmp]$ ps aux | grep sge
> rsmadmin  4203  0.0  0.0 161944  1976 ?        Sl   13:04   0:01 /data_storage/HPC/ge2011.11/bin/linux-x64/sge_execd
> 
> In /tmp I find only two exec message, which are from my first try, when I tried to uninstall SGE and reinstall it:
> [jrath@ calcuserver03 tmp]$ cat execd_messages.4055
> 11/13/2012 12:39:10|  main| calcuserver03|W|daemonize error: child exited before sending daemonize state

Is there an older daemon still running?

-- Reuti


> Regards
>  Jochen
> 
> -----Ursprüngliche Nachricht-----
> Von: Reuti [mailto:reuti at Staff.Uni-Marburg.DE] 
> Gesendet: Dienstag, 13. November 2012 14:10
> An: RATH Jochen (AREVA Wind GmbH)
> Cc: users at gridengine.org
> Betreff: Re: [gridengine users] New Execution Host: load_avg = -NA-
> 
> Hi,
> 
> Am 13.11.2012 um 13:26 schrieb RATH Jochen (AREVA):
> 
>> I have installed a new execution host to my existing OGE pool. Unfortunately I can't start jobs, because the load average won't be submitted to the qmaster host:
>> [root@ master ge2011.11]# qstat -F la
>> queuename                      qtype resv/used/tot. load_avg arch          states
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver03.edom.ad.corp BIP   0/0/32         -NA-     -NA-          a
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver02.edom.ad.corp BIP   0/2/12         10.15    linux-x64
>>       hl:load_avg=10.150000
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver01.edom.ad.corp BIP   0/0/12         0.00     linux-x64
>>       hl:load_avg=0.000000
>> 
>> My grid consist of one master and now three execution nodes. All is installed on a nfs-directory /data_storage, which is stored on the master. The message of the calcuserver03 is:
>> [root@ master calcuserver03]# cat messages
>> 11/13/2012 13:04:21|  main| calcuserver03|W|local configuration localhost.localdomain not defined - using global configuration
>> 11/13/2012 13:04:21|  main| calcuserver03|I|starting up OGS/GE 2011.11 (linux-x64)
> 
> This message is harmless. It looks like the exechost can contact the qmaster (to request the configuration), fine. But is the execd still running? Maybe it crashed during startup - any file "execd..." in /tmp? I suppose, the `qhost` output shows a similar information.
> 
> 
>> On the master and calcuserver01 runs RHEL 5.8 and on the calcuserver02 and calcuserver03 runs RHEL 6.3. At every server is the iptables stopped and they are all inserted in /etc/hosts.allow.
> 
> This is only necessary for applications using the tcp-wrapper and if certain/all services are denied in /etc/hosts.deny by default.
> 
> -- Reuti
> 
>> Why can't the qmaster get information of the load_avg of the new server? Which information do you need further?
>> 
>> Regards
>>     Jochen
>> 
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
users at gridengine.org
https://gridengine.org/mailman/listinfo/users



More information about the users mailing list