[gridengine users] SGE deos not support multi cluster function?

Reuti reuti at staff.uni-marburg.de
Mon Jan 5 12:09:38 UTC 2015


Hi,

> Am 05.01.2015 um 11:21 schrieb William Hay <w.hay at ucl.ac.uk>:
> 
> On Fri, 26 Dec 2014 02:15:42 +0000
> Sangmin Park <dorimosiada at gmail.com> wrote:
> 
>> Hi,
>> 
>> I manage several hpc machines in my site.
>> Each machine consists of master node and computing nodes.
>> User can access master nodes of each machine via login node, but can not access computing nodes directly from the login node.
>> 
>> SGE is installed in each machine.On each machine, all SGE command operates correctly, whereas on the login node, it does not work. When I typed 'qstat' command in login node, cursor is waiting without any kind of output forever.

Somehow I never got this email.

Communication blocked by a firewall?


>> Of course, SGE is installed in login node, too.

So from each login node you can reach all master nodes (i.e. to login there)? But you want to issue the SGE commands directly on the login in node(s) instead?

Was there any setup done that the login nodes could know about the cluster?


>> Is there method I can submit a job from login node to hpc machine?
>> Is it possible?

SGE doesn't support multi clustering out of the box. It can by achieved by several means though, but it needs some configurations steps.

a) Do you have a central home across all clusters, or do you need some file staging to route the necessary input and output files to each particular cluster?

b) The SGE installed on the "login" node needs to know which cluster to address. This can be done by 1) mounting each $SGE_ROOT (or only $SGE_ROOT/default/common*) from each cluster and setting the local value of $SGE_ROOT to this mounted directory before each command; or 2) copy one time $SGE_ROOT/default/common* to the login node and setting $SGE_ROOT in the same way.

c) Before you submit a job, you have to set $SGE_ROOT targeting the cluster you want to address (or use for `qstat`). A `qstat` wrapper could reset $SGE_ROOT several times and display the overall status of the clusters.

==

A more sophisticated setup would involve:

- a local SGE instance on the login node, i.e. you submit on the machine itself
- a load sensor, which will change the $SGE_ROOT several times and display the load or free slots on each cluster in a unique complex
- a starter method, which will forward the local scheduled jobs to one of the remote clusters

There is an older Howto by Charu Chaubal, but it needs some adjustments to work with SGE 6 or later. I set it up one time (including file staging to the remote cluster). But as this was working with the particular applications we use only, I never made a newer Howto of it (I used the job context to specify the type of computation and which files to forward or copy back).

http://arc.liv.ac.uk/SGE/howto/TransferQueues/transferqueues.html

-- Reuti

*) Replace "default" with each cell's name her


> The qstat command should either work or let you know the node isn't an admin/submit node.
> 
> A simpler test might be to try the qping command to check basic connectivity 
> to the qmaster.  
> 
> It might be a routing issue.  Around here our login nodes are dual homed.  
> Possibly the login nodes are trying to access the qmaster via the external 
> interface for some reason.
> 
>> 
>> - Sangmin
>> 
> 
> 
> -- 
> William Hay <w.hay at ucl.ac.uk>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list