[gridengine users] Installing OGE on Rocks Login Node
Joseph Farran
jfarran at uci.edu
Thu May 10 19:47:40 UTC 2012
Thanks Mike & Rayson.
I will investigate this.
Joseph
On 05/09/2012 10:19 PM, Rayson Ho wrote:
> Thanks Mike for the answer. And BTW, we have a HOWTO related to this:
>
> http://gridscheduler.sourceforge.net/howto/multi_intrfcs.html
>
> Grid Engine is usually quite picky on name resolution. In the past, we
> have received a few reports related to multi-NIC servers, 127.0.0.1&
> "localhost" resolution issues, and of course we started the work on
> IPv6 a while ago - so there are a few things that Grid Engine needs to
> be enhanced related to the commlib (aka Communication Library).
>
> Last we changed something major in the commlib was in 2005 - when Ron
> & I added poll(2) support for Linux& Solaris to support more than
> 1024 nodes (before that you could not have more than ~1000 nodes on a
> Linux qmaster, and the workaround was to change the system include
> file to extend a hard-coded system limit when you compile SGE - which
> most people did not want to do even if they knew the hack). POLL(2)
> support was reviewed& enhanced by Christian Reissmann at Sun (now at
> the Oracle Grid Engine team - interesting, I sent Christian an email a
> few days ago, and Andy& Christian are still at Oracle). In 2009,
> Ionel emailed the dev list and wanted to add IPv6 support, and we (ie.
> Ionel, Christian, and I) exchanged a few emails related to the IPv6
> support. Basically we know the structure of the commlib, and we will
> get back to it - but for now, just use the method documented by Mike.
> When we are done with the higher priority things, we will fix
> non-critical issues that have known and clean workarounds.
>
> To us, if something works for other mission critical systems like LSF
> but doesn't in Grid Engine, then it is a bug. Those are on the list of
> things that we will add in Open Grid Scheduler/Grid Engine eventually.
>
> Rayson
>
>
>
>
> On Wed, May 9, 2012 at 6:06 PM, Mike Hanby<mhanby at uab.edu> wrote:
>> I have no idea if this is the solution, but we had an issue with Rocks and the head node where the daemon wouldn't start properly due to the private interface being on eth0. I would spit out a message similar to what you posted.
>>
>> The solution was to create the host_aliases file under default/common:
>>
>> echo "$(/bin/hostname -s).local $(/bin/hostname --fqdn) $(/bin/hostname -s)"> $SGE_ROOT/default/common/host_aliases
>>
>> Perhaps something similar needs to be done for the login node since it's multihomed.
>>
>> -----Original Message-----
>> From: users-bounces at gridengine.org [mailto:users-bounces at gridengine.org] On Behalf Of Joseph Farran
>> Sent: Wednesday, May 09, 2012 4:10 PM
>> To: users at gridengine.org Users
>> Subject: [gridengine users] Installing OGE on Rocks Login Node
>>
>> Hello.
>>
>> I have a cluster running Rocks 5.4.3 that I originally setup with Torque/Maui. I am testing Open Grid Scheduler using the ge2011.11.tar distribution.
>>
>> I setup OGE on the master head node and was able to also setup 6 compute nodes using "start_gui_installer" on the head node. All 6 compute nodes were setup without any issues.
>>
>> All works except that when I tried to setup our login node, I cannot. The login node has both a private& public network interfaces. I want to setup our login node "login-node.xxx.uci.edu" as an Executable and Submit node.
>>
>> When I try to setup our Rocks login node using the private name of login-1-1, it complains with:
>>
>> The error message was:
>> error: commlib error: access denied (client IP resolved to host name "login-1-1.local". This is not identical to clients host name "login-node.xxx.uci.edu")
>> ERROR: unable to contact qmaster using port 6444 on host "headnode.local"
>>
>> So then I try installing OGE using the public name of "login-node.xxx.uci.edu" and it also complains. As soon as I enter "login-node.xxx.uci.edu" the state column turns red with "Resolvable" and the "Install" GUI button is greyed out so I cannot continue.
>>
>> Looks like OGE is confused about the actual fully qualified name of our login node. The FQN is "login-node.xxx.uci.edu" but neither name seems to work.
>>
>> What is the correct why to get around this?
>>
>> Joseph
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
More information about the users
mailing list