[gridengine users] Transitioning from Torque/Maui to Open Grid Scheduler
reuti at staff.uni-marburg.de
Sun Apr 22 19:43:45 UTC 2012
Am 21.04.2012 um 20:53 schrieb Joseph A. Farran:
> Hi Rayson & Ron.
> Thank you both for responding.
> We do a lot of parallel runs with our cluster. Here is more info on what we currently have and I will keep this example down to 3 queues and 6 nodes for simplicity.
> With our current Torque setup, I have 6 64-core nodes. 3 nodes belong to the math group, 3 nodes to the bio group. We setup our Queues as: 1 Queue being Preemptee, 2 being Preemptors.
> When I create an account, the account is setup to belong to the 'math' group, or to the 'bio' group.
> Our current nodes and Queues are as follows:
> 3 nodes have the properties "math", "free" and "64" cores.
> 3 nodes have the properties "bio", "free" and "64" cores.
> The "math" Queue looks for nodes with "math" properties and run jobs only on "math" nodes. Math Q is Preemptor.
> The "bio" Queue looks for nodes with "bio" properties and runs jobs only on the "bio" nodes. Bio Q is Preemptor.
> The "free" Queue looks for nodes with "free" properties and runs jobs on any node BUT only as a Preemptee job.
By default you specify resource requests and SGE will select an appropriate queue for your job as Rayson layed out.
For your setup I suggest:
- define one ACL for "math" with their members
- define one ACL for "bio" with their members
- define one hostgroup for "@math" machines
- define one hostgroup for "@bio" machines
- then you can limit the access to certain nodes for a group either:
--> on a queue-instance level
--> with an RQS
(--> on a host level, but not in your setup due to the preempt queue, just to be complete)
Let's go with the queue-instance:
$ qconf -sq normal.q
hostlist @math, at bio
@math = hostgroup math
math = ACL with math users
For the second queue you can specify a preemption either on a slotwise level or as soon as one slot is used by the owning group the node in question:
$ qconf -sq free.q
user_lists NONE,[@math=bio],[@bio=math] (assuming noone wants to submit to his own machine in the preempt queue, otherwise leave it out)
Although you could submit jobs to either queue by specifying either "-q nromal.q" resp. "free.q", I suggest to create a boolean complex with FORCED attribute and attach it only to the free.q
$ qconf -sq free.q
The advantage is, that for normal jobs you can submit with a plain `qsub job.sh`, and jobs won't get to free.q. For the jobs running on voluntary nodes then this complex needs to be requested: `qsub -l free job.sh`.
NB: Suspended jobs will still use memory or other requested resources.
> The idea here is that the free Q allows everyone to use the "free" nodes as long as the owners (math or bio) are not using them. The free Q is setup as a Preemptee Q, the math & bio Q's are setup as Preemptor Q's.
> When the math users submit a job to the math Q, any free job running on the math nodes get suspended.
> When the bio users submit a job to the bio Q, any free job on the bio nodes also get suspended.
> Suspended jobs automatically resume when the node owners are done using their nodes (no jobs on node).
> With Torque, math users can request from 1 to 3 math nodes and from 1-64 cores on each node. For example, a math user can request 2 math nodes at 32 cores each in interactive mode with:
> qsub -I -q math nodes=2:ppn=32
> If the user does not belong to the 'math' group, they are prevented from running on the math Q. Same for the bio users.
> I will stop here as I have more requirements, but this is the main set of functions I am looking for in OGE.
> Thank you again for your generous efforts in helping.
> On 4/20/2012 9:01 PM, Rayson Ho wrote:
>> Hi Joseph,
>> "Queues" in Grid Engine (and Open Grid Scheduler/Grid Engine) and the
>> ones Torque/Maui have slightly different meaning.
>> In Grid Engine, jobs are not submitted to "queues", but rather jobs
>> are submitted to the global waiting area. Then the scheduler picks
>> "queue instances" (queue instances roughly = hosts, yet each host can
>> have more than 1 queue instance) that satisfy the resource
>> requirements of each job, and at that point they are binded to the
>> We also have global queues called "cluster queues", but they are
>> abstraction of the queue instances.
>> So what does that all mean??
>> In LSF or Torque, some clusters have debug queues, short queues, long
>> queues, etc. Those can be migrated to Grid Engine cluster queues with
>> some work (ie. relatively easy).
>> If you want queue level user-based fairshare or queue-based fairshare
>> in LSF (eg. users in each queue gets a different priority) - I have
>> not looked at Maui for a while, not sure if it has this feature, then
>> it can be harder to implement or model in Grid Engine.
>> If you let us know a bit more about your setup, then we can provide
>> further help.
>> On Fri, Apr 20, 2012 at 11:42 PM, Joseph A. Farran<jfarran at uci.edu> wrote:
>>> Hi All.
>>> I am a long time Torque/Maui admin running an HPC cluster looking to
>>> Transition to Open Grid Engine. I am a newbie with OGE however.
>>> Are there any links and or helpful tips on moving to OGE from an admin point
>>> of view? How to convert Torque qmgr queues, nodes, resource limits to the
>>> equivalent in OGE?
>>> users mailing list
>>> users at gridengine.org
> users mailing list
> users at gridengine.org
More information about the users