[gridengine users] New Execution Host: load_avg = -NA-

RATH Jochen (AREVA) jochen.rath at areva.com
Tue Nov 13 13:25:08 UTC 2012

Hi Reuti

Thanks for your replay.
When I look for SGE with ps, it is still running:
[jrath@ calcuserver03 tmp]$ ps aux | grep sge
rsmadmin  4203  0.0  0.0 161944  1976 ?        Sl   13:04   0:01 /data_storage/HPC/ge2011.11/bin/linux-x64/sge_execd

In /tmp I find only two exec message, which are from my first try, when I tried to uninstall SGE and reinstall it:
[jrath@ calcuserver03 tmp]$ cat execd_messages.4055
11/13/2012 12:39:10|  main| calcuserver03|W|daemonize error: child exited before sending daemonize state


-----Ursprüngliche Nachricht-----
Von: Reuti [mailto:reuti at Staff.Uni-Marburg.DE] 
Gesendet: Dienstag, 13. November 2012 14:10
An: RATH Jochen (AREVA Wind GmbH)
Cc: users at gridengine.org
Betreff: Re: [gridengine users] New Execution Host: load_avg = -NA-


Am 13.11.2012 um 13:26 schrieb RATH Jochen (AREVA):

> I have installed a new execution host to my existing OGE pool. Unfortunately I can't start jobs, because the load average won't be submitted to the qmaster host:
> [root@ master ge2011.11]# qstat -F la
> queuename                      qtype resv/used/tot. load_avg arch          states
> ---------------------------------------------------------------------------------
> all.q at calcuserver03.edom.ad.corp BIP   0/0/32         -NA-     -NA-          a
> ---------------------------------------------------------------------------------
> all.q at calcuserver02.edom.ad.corp BIP   0/2/12         10.15    linux-x64
>        hl:load_avg=10.150000
> ---------------------------------------------------------------------------------
> all.q at calcuserver01.edom.ad.corp BIP   0/0/12         0.00     linux-x64
>        hl:load_avg=0.000000
> My grid consist of one master and now three execution nodes. All is installed on a nfs-directory /data_storage, which is stored on the master. The message of the calcuserver03 is:
> [root@ master calcuserver03]# cat messages
> 11/13/2012 13:04:21|  main| calcuserver03|W|local configuration localhost.localdomain not defined - using global configuration
> 11/13/2012 13:04:21|  main| calcuserver03|I|starting up OGS/GE 2011.11 (linux-x64)

This message is harmless. It looks like the exechost can contact the qmaster (to request the configuration), fine. But is the execd still running? Maybe it crashed during startup - any file "execd..." in /tmp? I suppose, the `qhost` output shows a similar information.

> On the master and calcuserver01 runs RHEL 5.8 and on the calcuserver02 and calcuserver03 runs RHEL 6.3. At every server is the iptables stopped and they are all inserted in /etc/hosts.allow.

This is only necessary for applications using the tcp-wrapper and if certain/all services are denied in /etc/hosts.deny by default.

-- Reuti

> Why can't the qmaster get information of the load_avg of the new server? Which information do you need further?
> Regards
>      Jochen
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users

More information about the users mailing list