[gridengine users] question about managing queues

Carl G. Riches cgr at u.washington.edu
Tue Jul 28 18:03:07 UTC 2015


We have a Rocks cluster (Rocks release 6.1) with the SGE roll (rocks-sge
6.1.2 [GE2011]).  Usage levels have grown faster than the cluster's 
capacity.  We have a single consumable resource (number of CPUs) that we 
are trying to manage in a way that is acceptable to our users.  Before 
diving in on a solution, I would like to find out if others have dealt 
with our particular problem.

Here is a statement of the problem:
- There is a fixed amount of a resource called "number of CPUs" available.
- There are many possible users of the resource "number of CPUs".
- There is a variable number of the resource in use at any given time.
- When the resource is exhausted, requests to use the resource queue up
   until some amount of the resource becomes available again.
- In the event that resource use requests have queued up, we must manage
   the resource in some way.

The way we would like to manage the resource is this:
1. In the event that no requests for the resource are queued up, do
    nothing.
2. In the event that a single user is consuming all of the resource and
    all queued requests for the resource belong to the same user that is
    using all of the resource, do nothing.
3. In the event that a single user is consuming all of the resource and
    not all queued requests for the resource belong to the same user that
    is using all of the resource, "manage the resource".
4. In the event that there are queued requests for the resource and the
    resource is completely used by more than one user, "manage the
    resource".

By "manage the resource" we mean:
a. If a user is consuming more than some arbitrary limit of the resource
    (call it L), suspend one of that user's jobs.
b. Determine how much of the resource (CPUs) are made available by the
    prior step.
c. Find a job in the list of queued requests that uses less than or equal
    to the resources made available in the last step _and_ does not belong
    to a user currently using some arbitrary limit L (or more) of the
    resource, then dispatch the job.
d. Repeat the prior step until the available resource is less than the
    resource required by jobs in the list of queued requests.

Steps 1-4 above would be repeated at regular intervals to ensure that the
resource is shared.


Has anyone on the list tried to do this sort of queue management?  If so,
how did you go about the task?  Is this possible with Grid Engine?

Thanks in advance,
Carl G. Riches
IT Director
Department of Biostatistics
Box 357232                      voice:     206-616-2725
University of Washington        fax:       206-543-3286
Seattle, WA  98195-7232         internet:  cgr at u.washington.edu




More information about the users mailing list