[gridengine users] OGE on Mac OS X - head node with 'E' status but qmaster and rest of cluster work fine

Reuti reuti at staff.uni-marburg.de
Mon Nov 12 20:42:29 UTC 2012


Am 12.11.2012 um 21:31 schrieb Drew Kitchen:

>>> Dear List,
>>> 
>>> I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it seems to be
>>> working but with one semi-major glitch. (Why iMacs, you ask...well, they are what I
>>> inherited from a guy that moved his lab...5 iMacs and various other boxes.)
>>> 
>>> I compiled the OGE source locally, and that went great after I tweaked it to find
>>> darwin-x64 and whatnot. Installation went great, following the wonderful install vids
>>> that have been posted for GE on Mac OS X. I have qmaster running on dhcp80fff96b, with
>>> three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and an NFS share
>>> between them (where GE resides). Passwordless ssh is enabled for the GE owner, so the
>>> boxes should be able to communicate.
>> This shouldn't be necessary for the operation of OGE - just for the installation it *might* be necessary (but you can also do it without by local installations).
> 
> Thanks. I was thinking of MPI jobs and communicating between nodes.
> 
>>> So, this is where the problems arise: in all.q, the execution host on the master node
>>> running qmaster throws an E status.
>>> 
>>> <cut>
>>> dhcp80fff96b:~ akitchen$ qstat -f
>>> queuename                      qtype resv/used/tot. load_avg arch          states
>>> ---------------------------------------------------------------------------------
>>> all.q at dhcp80fff96b.state.eduBIP   0/0/2 0.02     darwin-x64    E

NB: Does the error reappear when you reset it with `qmod -cq all.q at dhcp80fff96b`?

-- Reuti


>>> ---------------------------------------------------------------------------------
>>> all.q at dhcp80fff9b6.state.eduBIP   0/0/2 0.00     darwin-x64
>>> ---------------------------------------------------------------------------------
>>> all.q at dhcp80fff9d0.state.eduBIP   0/0/2 0.00     darwin-x64
>>> <cut>
>>> 
>>> I can submit jobs and they will be successfully farmed out to the external execution
>>> hosts, so it would seem that everything is fine and dandy. Meanwhile, the execution
>>> daemon is working on the master node.
>>> 
>>> <cut>
>>> dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
>>> 11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is up since 89828 seconds
>>> <cut>
>>> 
>>> I've tried just about everything (even rebooting the master node), and nothing seems to
>>> solve this. I've looked in the spool messages to troubleshoot, and I get a cryptic
>>> "commlib error".
>>> 
>>> <cut>
>>> 11/07/2012 15:27:47|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 (darwin-x64)
>>> 11/08/2012 10:43:00|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 (darwin-x64)
>>> 11/08/2012 10:43:02|  main|dhcp80fff96b|E|commlib error: got read error (closing "dhcp80fff96b.state.edu/qmaster/1")
>>> 11/08/2012 10:43:03|  main|dhcp80fff96b|W|can't register at qmaster "dhcp80fff96b.state.edu": abort qmaster registration due to communication errors
>>> 11/08/2012 10:43:03|  main|dhcp80fff96b|E|commlib error: can't connect to service (Connection refused)
>> The ports 6444 and 6445 are excluded from the firewalls?
>> 
>> All machines get always the same address?
>> 
>> -- Reuti
> 
> Yes, all machines have stable IPs and they get the same address when queried. All firewalls are disabled (they exist under the uni's firewall), so that shouldn't be a problem. All machines also have 6444/6445 reserved for qmaster/execd, respectively.
> 
> Thanks for the help!
> 
> Cheers,
> Drew
> 
>>> 11/08/2012 10:43:35|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 (darwin-x64)
>>> 11/08/2012 10:52:45|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 (darwin-x64)
>>> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|controlled shutdown 2011.11p1
>>> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 (darwin-x64)
>>> <cut>
>>> 
>>> Otherwise, everything seems to be running fine. I've scrounged around and found a couple
>>> Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure this out
>>> before adding them (and maybe shifting qmaster to one of them).
>>> 
>>> Any help would be greatly appreciated!
>>> 
>>> Cheers and best,
>>> Drew
>>> 
>>> P.S. Here is some more info for anyone curious....
>>> 
>>> 
>>> dhcp80fff96b:~ akitchen$ hostname
>>> dhcp80fff96b.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr XXX.XXX.XXX.107
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname dhcp80fff96b.state.edu
>>> Hostname: dhcp80fff96b.state.edu
>>> Aliases:  ANTH-M014 dhcp80fff96b
>>> Host Address(es): XXX.XXX.XXX.107
>>> 
>>> dhcp80fff96b:~ akitchen$ cat /etc/hosts
>>> ##
>>> # Host Database
>>> #
>>> # localhost is used to configure the loopback interface
>>> # when the system is booting.  Do not change this entry.
>>> ##
>>> 127.0.0.1    localhost
>>> 255.255.255.255    broadcasthost
>>> ::1             localhost
>>> fe80::1%lo0    localhost
>>> XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
>>> XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
>>> XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
>>> group_name @allhosts
>>> hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
>>>         dhcp80fff9b6.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -sel
>>> dhcp80fff96b.state.edu
>>> dhcp80fff9b6.state.edu
>>> dhcp80fff9d0.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -ss
>>> dhcp80fff96b.state.edu
>>> 
>>> dhcp80fff96b:~ akitchen$ qconf -sh
>>> dhcp80fff96b.state.edu
>>> dhcp80fff9b6.state.edu
>>> dhcp80fff9d0.state.edu
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 





More information about the users mailing list