[gridengine users] OGE on Mac OS X - head node with 'E' status but qmaster and rest of cluster work fine

Drew Kitchen drupiter at gmail.com
Mon Nov 12 13:48:26 UTC 2012


Dear List,

I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and 
it seems to be
working but with one semi-major glitch. (Why iMacs, you ask...well, they 
are what I
inherited from a guy that moved his lab...5 iMacs and various other boxes.)

I compiled the OGE source locally, and that went great after I tweaked 
it to find
darwin-x64 and whatnot. Installation went great, following the wonderful 
install vids
that have been posted for GE on Mac OS X. I have qmaster running on 
dhcp80fff96b, with
three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), 
and an NFS share
between them (where GE resides). Passwordless ssh is enabled for the GE 
owner, so the
boxes should be able to communicate.

So, this is where the problems arise: in all.q, the execution host on 
the master node
running qmaster throws an E status.

<cut>
dhcp80fff96b:~ akitchen$ qstat -f
queuename                      qtype resv/used/tot. load_avg 
arch          states
--------------------------------------------------------------------------------- 

all.q at dhcp80fff96b.state.eduBIP   0/0/2 0.02     darwin-x64    E
--------------------------------------------------------------------------------- 

all.q at dhcp80fff9b6.state.eduBIP   0/0/2 0.00     darwin-x64
--------------------------------------------------------------------------------- 

all.q at dhcp80fff9d0.state.eduBIP   0/0/2 0.00     darwin-x64
<cut>

I can submit jobs and they will be successfully farmed out to the 
external execution
hosts, so it would seem that everything is fine and dandy. Meanwhile, 
the execution
daemon is working on the master node.

<cut>
dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 
is up since 89828 seconds
<cut>

I've tried just about everything (even rebooting the master node), and 
nothing seems to
solve this. I've looked in the spool messages to troubleshoot, and I get 
a cryptic
"commlib error".

<cut>
11/07/2012 15:27:47|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:43:00|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:43:02|  main|dhcp80fff96b|E|commlib error: got read error 
(closing "dhcp80fff96b.state.edu/qmaster/1")
11/08/2012 10:43:03|  main|dhcp80fff96b|W|can't register at qmaster 
"dhcp80fff96b.state.edu": abort qmaster registration due to 
communication errors
11/08/2012 10:43:03|  main|dhcp80fff96b|E|commlib error: can't connect 
to service (Connection refused)
11/08/2012 10:43:35|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 10:52:45|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
11/08/2012 12:31:14|  main|dhcp80fff96b|I|controlled shutdown 2011.11p1
11/08/2012 12:31:14|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
(darwin-x64)
<cut>

Otherwise, everything seems to be running fine. I've scrounged around 
and found a couple
Mac Minis that I'd like to add to the mini-cluster, but I'd rather 
figure this out
before adding them (and maybe shifting qmaster to one of them).

Any help would be greatly appreciated!

Cheers and best,
Drew

P.S. Here is some more info for anyone curious....


dhcp80fff96b:~ akitchen$ hostname
dhcp80fff96b.state.edu

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr 
XXX.XXX.XXX.107
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname 
dhcp80fff96b.state.edu
Hostname: dhcp80fff96b.state.edu
Aliases:  ANTH-M014 dhcp80fff96b
Host Address(es): XXX.XXX.XXX.107

dhcp80fff96b:~ akitchen$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1    localhost
255.255.255.255    broadcasthost
::1             localhost
fe80::1%lo0    localhost
XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0

dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
group_name @allhosts
hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
          dhcp80fff9b6.state.edu

dhcp80fff96b:~ akitchen$ qconf -sel
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu

dhcp80fff96b:~ akitchen$ qconf -ss
dhcp80fff96b.state.edu

dhcp80fff96b:~ akitchen$ qconf -sh
dhcp80fff96b.state.edu
dhcp80fff9b6.state.edu
dhcp80fff9d0.state.edu


More information about the users mailing list