[gridengine users] New Execution Host: load_avg = -NA-

Reuti reuti at Staff.Uni-Marburg.DE
Tue Nov 13 14:30:32 UTC 2012


Am 13.11.2012 um 14:25 schrieb RATH Jochen (AREVA):

> Thanks for your replay.
> When I look for SGE with ps, it is still running:
> [jrath@ calcuserver03 tmp]$ ps aux | grep sge
> rsmadmin  4203  0.0  0.0 161944  1976 ?        Sl   13:04   0:01 /data_storage/HPC/ge2011.11/bin/linux-x64/sge_execd
> 
> In /tmp I find only two exec message, which are from my first try, when I tried to uninstall SGE and reinstall it:
> [jrath@ calcuserver03 tmp]$ cat execd_messages.4055
> 11/13/2012 12:39:10|  main| calcuserver03|W|daemonize error: child exited before sending daemonize state

Is there an older daemon still running?

-- Reuti


> Regards
>  Jochen
> 
> -----Ursprüngliche Nachricht-----
> Von: Reuti [mailto:reuti at Staff.Uni-Marburg.DE] 
> Gesendet: Dienstag, 13. November 2012 14:10
> An: RATH Jochen (AREVA Wind GmbH)
> Cc: users at gridengine.org
> Betreff: Re: [gridengine users] New Execution Host: load_avg = -NA-
> 
> Hi,
> 
> Am 13.11.2012 um 13:26 schrieb RATH Jochen (AREVA):
> 
>> I have installed a new execution host to my existing OGE pool. Unfortunately I can't start jobs, because the load average won't be submitted to the qmaster host:
>> [root@ master ge2011.11]# qstat -F la
>> queuename                      qtype resv/used/tot. load_avg arch          states
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver03.edom.ad.corp BIP   0/0/32         -NA-     -NA-          a
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver02.edom.ad.corp BIP   0/2/12         10.15    linux-x64
>>       hl:load_avg=10.150000
>> ---------------------------------------------------------------------------------
>> all.q at calcuserver01.edom.ad.corp BIP   0/0/12         0.00     linux-x64
>>       hl:load_avg=0.000000
>> 
>> My grid consist of one master and now three execution nodes. All is installed on a nfs-directory /data_storage, which is stored on the master. The message of the calcuserver03 is:
>> [root@ master calcuserver03]# cat messages
>> 11/13/2012 13:04:21|  main| calcuserver03|W|local configuration localhost.localdomain not defined - using global configuration
>> 11/13/2012 13:04:21|  main| calcuserver03|I|starting up OGS/GE 2011.11 (linux-x64)
> 
> This message is harmless. It looks like the exechost can contact the qmaster (to request the configuration), fine. But is the execd still running? Maybe it crashed during startup - any file "execd..." in /tmp? I suppose, the `qhost` output shows a similar information.
> 
> 
>> On the master and calcuserver01 runs RHEL 5.8 and on the calcuserver02 and calcuserver03 runs RHEL 6.3. At every server is the iptables stopped and they are all inserted in /etc/hosts.allow.
> 
> This is only necessary for applications using the tcp-wrapper and if certain/all services are denied in /etc/hosts.deny by default.
> 
> -- Reuti
> 
>> Why can't the qmaster get information of the load_avg of the new server? Which information do you need further?
>> 
>> Regards
>>     Jochen
>> 
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
> 
> 




More information about the users mailing list