[gridengine users] Transitioning from Torque/Maui to Open Grid Scheduler

Reuti reuti at staff.uni-marburg.de
Sun Apr 22 19:43:45 UTC 2012


Am 21.04.2012 um 20:53 schrieb Joseph A. Farran:

> Hi Rayson & Ron.
> 
> Thank you both for responding.
> 
> We do a lot of parallel runs with our cluster.   Here is more info on what we currently have and I will keep this example down to 3 queues and 6 nodes for simplicity.
> 
> With our current Torque setup, I have 6 64-core nodes.   3 nodes belong to the math group, 3 nodes to the bio group.   We setup our Queues as: 1 Queue being Preemptee, 2 being Preemptors.
> 
> When I create an account, the account is setup to belong to the 'math' group, or to the 'bio' group.
> 
> Our current nodes and Queues are as follows:
> 
>    3 nodes have the properties "math", "free" and "64" cores.
>    3 nodes have the properties "bio", "free" and  "64" cores.
> 
>    The "math" Queue looks for nodes with "math" properties and run jobs only on "math" nodes.   Math Q is Preemptor.
>    The "bio" Queue looks for nodes with "bio" properties and runs jobs only on the "bio" nodes.   Bio Q is Preemptor.
>    The "free" Queue looks for nodes with "free" properties and runs jobs on any node BUT only as a Preemptee job.

By default you specify resource requests and SGE will select an appropriate queue for your job as Rayson layed out.

For your setup I suggest:

- define one ACL for "math" with their members
- define one ACL for "bio" with their members
- define one hostgroup for "@math" machines
- define one hostgroup for "@bio" machines
- then you can limit the access to certain nodes for a group either:
--> on a queue-instance level
--> with an RQS
(--> on a host level, but not in your setup due to the preempt queue, just to be complete)

Let's go with the queue-instance:

$ qconf -sq normal.q
hostlist  @math, at bio
...
user_lists NONE,[@math=math],[@bio=bio]

where

@math = hostgroup math
math = ACL with math users


For the second queue you can specify a preemption either on a slotwise level or as soon as one slot is used by the owning group the node in question:

$ qconf -sq free.q
...
user_lists NONE,[@math=bio],[@bio=math]  (assuming noone wants to submit to his own machine in the preempt queue, otherwise leave it out)
...
subordinate_list normal.q=1

Although you could submit jobs to either queue by specifying either "-q nromal.q" resp. "free.q", I suggest to create a boolean complex with FORCED attribute and attach it only to the free.q

$ qconf -sq free.q
...
complex_values free=TRUE

The advantage is, that for normal jobs you can submit with a plain `qsub job.sh`, and jobs won't get to free.q. For the jobs running on voluntary nodes then this complex needs to be requested: `qsub -l free job.sh`.

-- Reuti

NB: Suspended jobs will still use memory or other requested resources.


> The idea here is that the free Q allows everyone to use the "free" nodes as long as the owners (math or bio) are not using them.  The free Q is setup as a Preemptee Q, the math & bio Q's are setup as Preemptor Q's.
> 
> When the math users submit a job to the math Q, any free job running on the math nodes get suspended.
> 
> When the bio users submit a job to the bio Q, any free job on the bio nodes also get suspended.
> 
> Suspended jobs automatically resume when the node owners are done using their nodes (no jobs on node).
> 
> With Torque, math users can request from 1 to 3 math nodes and from 1-64 cores on each node.   For example, a math user can request 2 math nodes at 32 cores each in interactive mode with:
> 
>    qsub -I  -q math nodes=2:ppn=32
> 
> If the user does not belong to the 'math' group, they are prevented from running on the math Q.   Same for the bio users.
> 
> I will stop here as I have more requirements, but this is the main set of functions I am looking for in OGE.
> 
> Thank you again for your generous efforts in helping.
> 
> Joseph
> 
> 
> On 4/20/2012 9:01 PM, Rayson Ho wrote:
>> Hi Joseph,
>> 
>> "Queues" in Grid Engine (and Open Grid Scheduler/Grid Engine) and the
>> ones Torque/Maui have slightly different meaning.
>> 
>> In Grid Engine, jobs are not submitted to "queues", but rather jobs
>> are submitted to the global waiting area. Then the scheduler picks
>> "queue instances" (queue instances roughly = hosts, yet each host can
>> have more than 1 queue instance) that satisfy the resource
>> requirements of each job, and at that point they are binded to the
>> queues.
>> 
>> We also have global queues called "cluster queues", but they are
>> abstraction of the queue instances.
>> 
>> So what does that all mean??
>> 
>> In LSF or Torque, some clusters have debug queues, short queues, long
>> queues, etc. Those can be migrated to Grid Engine cluster queues with
>> some work (ie. relatively easy).
>> 
>> If you want queue level user-based fairshare or queue-based fairshare
>> in LSF (eg. users in each queue gets a different priority) - I have
>> not looked at Maui for a while, not sure if it has this feature, then
>> it can be harder to implement or model in Grid Engine.
>> 
>> If you let us know a bit more about your setup, then we can provide
>> further help.
>> 
>> Rayson
>> 
>> 
>> 
>> On Fri, Apr 20, 2012 at 11:42 PM, Joseph A. Farran<jfarran at uci.edu>  wrote:
>>> Hi All.
>>> 
>>> I am a long time Torque/Maui admin running an HPC cluster looking to
>>> Transition to Open Grid Engine.   I am a newbie with OGE however.
>>> 
>>> Are there any links and or helpful tips on moving to OGE from an admin point
>>> of view?   How to convert Torque qmgr queues, nodes, resource limits to the
>>> equivalent in OGE?
>>> 
>>> Thanks,
>>> Joseph
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list