[gridengine users] Job runs on nodes that are not part of queue!

Reuti reuti at staff.uni-marburg.de
Mon Jan 23 22:04:55 UTC 2012


Am 23.01.2012 um 21:55 schrieb Andrew Pearson:

> Thanks Reuti

You're welcome.


> OK - I made duplicates of all of my parallel environments, so that the slow queue has a different PE list than the fast queue.  The submitted job now runs on the correct queue.
> 
> However, in some sense I'm back to square one.  The reason I created two queues and made them non-requestable is that I wanted to assign resources to users, rather than have them choose them.  Now, the user can effectively choose which queue to be in by choosing the correct parallel environment.  I can't see a way to make the parallel environments non-requestable.

The queue you can even leave as requestable. This is the way SGE usually works: a user request resources and SGE will choose an appropriate queue to satisfy these requests.

Nevertheless: in case you want to enforce a policy you can use a JSV to correct/remove resource requests of the user or also to attach some on your own. In your case:

- a queue is requested, remove the request
- a specific PE is requested: replace it with an attached asterisk

Instead of correcting the request, you could also just output that the job is declined and why.

====
#!/bin/sh

PATH=/bin:/usr/bin

jsv_on_start()
{
   return
}

jsv_on_verify()
{

   do_correct="false" 
   do_wait="false"

   pe_name=$(jsv_get_param pe_name)
   if [ "$pe_name" ]; then
      if ! [[ $pe_name =~ [*]$ ]]; then
         jsv_set_param pe_name "$pe_name*"
         do_correct="true"
      fi
   fi

   if [ "$do_wait" = "true" ]; then
      jsv_reject_wait "Job is rejected. It might be submitted later."
   elif [ "$do_correct" = "true" ]; then
      jsv_correct "Job was modified before it was accepted"
   else
      jsv_accept "Job is accepted"
   fi
   return
}

. ${SGE_ROOT}/util/resources/jsv/jsv_include.sh

jsv_main
===

which you can compare to the examples in $SGE_ROOT/usr/sge/util/resources/jsv.sh. If there is no asterisk at the end (BTW: the asterisk(s) could be anywhere in the string), one is appended (ok, you could always append one, it won't hurt) - `man jsv_script_interface` to implement similar corrections (i.e. removal):

   jsv_del_param q_hard
   jsv_del_param q_soft

in case it was set. The URL needs to be set too to this script:

$ qconf -sconf
...
jsv_url                      /home/reuti/jsv.sh

(Perl might be faster though).


> Even if this were possible however, if the user doesn't include a -pe line in their submission script, I don't see how they would specify the number of processors they need.

Is this a typo, if it's possible, the users can use it to specify the necessary slot count.

-- Reuti


> Sorry for my basic questions.  I'd appreciate any comments you have.
> 
> 
> 
> On Mon, Jan 23, 2012 at 2:57 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 23.01.2012 um 20:34 schrieb Andrew Pearson:
> 
> > Hi.  I'm trying to move from load-based to sequence based scheduling, and I have a problem.  First, a little something about my setup:
> >
> > I have two sets of machines - 176 'fast' cores in 16-core nodes, and 90 'slow' cores in 2-core nodes.  I have two corresponding queues - slow.q and fast.q.  The queues are non-requestable.  fast.q looks at the @fast host group, which contains only the names of the fast nodes, and slow.q looks at the @slow host group, which contains only the names of the slow nodes.  In fast.q, I have slots = 16 and processors = 16, while in slow.q I have slots = 2 and processors = 2.  Finally, slow.q is seq_no 1 and fast.q is seq_no 2.
> >
> > Here's the problem:  If I submit a 120 processor job (so it's too large to fit on the slow cores), it still gets assigned to slow.q.  This in itself is bad - I want such a job to go directly to fast.q.  Its gets worse though - because there aren't enough machines in slow.q, the remaining 30 threads end up on nodes in fast.q!  I don't understand how this second part is possible.  I've done qstat -f, and my 'fast' compute nodes definitely aren't listed as being members of slow.q.
> >
> > Any suggestions?  Thank you.
> 
> If the same PE is attached to more than one queue, it can collect slots from any of them:
> 
> http://gridengine.org/pipermail/users/2012-January/002526.html
> 
> -- Reuti
> 
> 




More information about the users mailing list