[gridengine users] running job holds and restart

Reuti reuti at staff.uni-marburg.de
Mon Oct 28 13:05:32 UTC 2013


Am 28.10.2013 um 13:59 schrieb Sangmin Park:

> yes, suspending the job when all 12 slots are used on a particular host. This is what I want to.
> So, I tried to submit job using 12 slots, but it did not work.

Aha, it might be necessary to change the order of rules in your RQS. The first matching one will allow or deny the job to be started. I.e. if all slots are used the (current) first rules matches and the job is rejected.

-- Reuti


> Still not working.. 
> 
> --Sangmin
> 
> 
> On Mon, Oct 28, 2013 at 9:47 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 28.10.2013 um 13:45 schrieb Sangmin Park:
> 
> > This is the RQS
> >
> >    limit        hosts {@parallelhosts} to slots=$num_proc
> >    limit        queues !matlab.q hosts {@matlabhosts} to slots=$num_proc
> > parallelhosts include matlabhosts.
> >
> > slots value in the matlab.q means the number of cores per node.
> >
> > All hosts is included in parallelhosts, node1 ~ node30.
> > matlabhosts include node1 ~ node7.
> > short.q, normal.q and long.q could be used in node1 ~ node7.
> >
> > I want to set up when jobs with short.q, normal.q and long.q are running, if matlab job is submitted,
> > running job not using matlab.q in node1 ~ node7 is suspended and matlab job is run.
> > This is what I want to set up.
> >
> > I don't understand why it can not be happened if I setup slots value 12.
> 
> It will suspend the job when all 12 slots are used on a particular host. You may want to try with 1 instead. As s refinement, you could also look into slotwise subordination.
> 
> -- Reuti
> 
> 
> > --Sangmin
> >
> >
> > On Mon, Oct 28, 2013 at 8:58 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 28.10.2013 um 12:30 schrieb Sangmin Park:
> >
> > > I've edit the negative value in the priority section, short.q is 4, normal.q is 6 and long.q is 8, respectively.
> > > And I configured 72 cores for each queues.
> >
> > But you didn't answer the question: How do you limit the overall slot count? RQS oder definition in the exechost?
> >
> > > Below is matlab.q instance details.
> > > qname                 matlab.q
> > > hostlist              @matlabhosts
> > > seq_no                0
> > > load_thresholds       np_load_avg=1.75
> > > suspend_thresholds    NONE
> > > nsuspend              1
> > > suspend_interval      00:05:00
> > > priority              2
> > > min_cpu_interval      00:05:00
> > > processors            UNDEFINED
> > > qtype                 BATCH INTERACTIVE
> > > ckpt_list             NONE
> > > pe_list               fill_up make matlab
> > > rerun                 FALSE
> > > slots                 12
> > > tmpdir                /tmp
> > > shell                 /bin/bash
> > > prolog                NONE
> > > epilog                NONE
> > > shell_start_mode      posix_compliant
> > > starter_method        NONE
> > > suspend_method        NONE
> > > resume_method         NONE
> > > terminate_method      NONE
> > > notify                00:00:60
> > > owner_list            NONE
> > > user_lists            octausers onsiteusers
> > > xuser_lists           NONE
> > > subordinate_list      short.q=72, normal.q=72, long.q=72
> >
> > This will suspend these tree queues when 72 slots per queue instance in matlab.q is used. As you have only 12 defined above, this will never happen.
> >
> > What behavior would you like to set up?
> >
> > -- Reuti
> >
> >
> > > complex_values        NONE
> > > projects              NONE
> > > xprojects             NONE
> > > calendar              NONE
> > > initial_state         default
> > > s_rt                  INFINITY
> > > h_rt                  168:00:00
> > > s_cpu                 INFINITY
> > > h_cpu                 INFINITY
> > > s_fsize               INFINITY
> > > h_fsize               INFINITY
> > > s_data                INFINITY
> > > h_data                INFINITY
> > > s_stack               INFINITY
> > > h_stack               INFINITY
> > > s_core                INFINITY
> > > h_core                INFINITY
> > > s_rss                 INFINITY
> > > h_rss                 INFINITY
> > > s_vmem                INFINITY
> > > h_vmem                INFINITY
> > >
> > > thanks,
> > >
> > > --Sangmin
> > >
> > >
> > > On Mon, Oct 28, 2013 at 3:51 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > Hi,
> > >
> > > Am 28.10.2013 um 06:40 schrieb Sangmin Park:
> > >
> > > > Thanks, adam
> > > >
> > > > I configured sge queue configuration following second link you said.
> > > > But, it does not work.
> > > >
> > > > I make 4 queues, short.q, normal.q, long.q and matlab.q
> > > > short.q, normal.q and long.q queue instances are running all computing nodes, node1 ~ node30.
> > > > matlab.q instance is configured only for a few nodes, node1 ~ node7, called matlabhosts
> > > >
> > > > The priorities of each queue is below.
> > > > [short.q]
> > > > priority              -5
> > >
> > > Don't use negative values here. This number is the "nice value" under which the Linux kernel will run the process (i.e. the scheduler in the kernel, for SGE it doesn't influence the scheduling). User processes should be in the range 0..19 [20 on Solaris]. The negative ones are reserved for kernel processes.
> > >
> > >
> > > > subordinate_list      NONE
> > > > [normal.q]
> > > > priority              0
> > > > subordinate_list      NONE
> > > > [long.q]
> > > > priority              5
> > > > subordinate_list      NONE
> > > >
> > > > and matlab.q is
> > > > priority              -10
> > > > subordinate_list      short.q normal.q long.q
> > >
> > > Same here. It's also worth to note, that these values are relative. I.e. having the same number of user processes and cores, it doesn't matter which values are used as nice values, as each process gets it's own core anyway. Only when there are more processes than cores it will have an effect. But as these are relative values, it's the same whether (cores+1) processes have all 0 or 19 as nice value.
> > >
> > >
> > > > I submited several jobs using normal.q to the matlabhosts
> > > > and I submited a job using matlab.q that has subordinate_list
> > > > I expected one of normal.q queue job is suspended and matlab.q queue job is running.
> > > > But, matlab.q queue job waits in queue with status qw. not submitted.
> > > >
> > > > what's the matter with this?
> > > > please help!!
> > >
> > > http://gridengine.org/pipermail/users/2013-October/006820.html
> > >
> > > How do you limit the overall slot count?
> > >
> > > -- Reuti
> > >
> > >
> > > > Sangmin
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Oct 15, 2013 at 3:50 PM, Adam Brenner <aebrenne at uci.edu> wrote:
> > > > Sangmin,
> > > >
> > > > I believe the phrase / term you are looking for is Subordinate
> > > > Queues[1][2]. This should handle what you are looking for.
> > > >
> > > > If not ... I am sure Reuti (or someone else) will correct me on this.
> > > >
> > > > Enjoy,
> > > > -Adam
> > > >
> > > > [1]: http://docs.oracle.com/cd/E19957-01/820-0698/i998889/index.html
> > > > [2]: http://grid-gurus.blogspot.com/2011/03/using-grid-engine-subordinate-queues.html
> > > >
> > > > --
> > > > Adam Brenner
> > > > Computer Science, Undergraduate Student
> > > > Donald Bren School of Information and Computer Sciences
> > > >
> > > > Research Computing Support
> > > > Office of Information Technology
> > > > http://www.oit.uci.edu/rcs/
> > > >
> > > > University of California, Irvine
> > > > www.ics.uci.edu/~aebrenne/
> > > > aebrenne at uci.edu
> > > >
> > > >
> > > > On Mon, Oct 14, 2013 at 11:18 PM, Sangmin Park <dorimosiada at gmail.com> wrote:
> > > > > Howdy,
> > > > >
> > > > > For specific purpose in my organization,
> > > > > I want to configure something to SGE scheduler.
> > > > >
> > > > > Imazine.
> > > > > a job is running, called A-job.
> > > > > If B-job is submitted during A-job is running,
> > > > > I want to hold A-job and run B-job first.
> > > > > And after B-job is finished, restart A-job.
> > > > >
> > > > > What do I do for this?
> > > > >
> > > > > Sangmin
> > > > >
> > > > > --
> > > > > ===========================
> > > > > Sangmin Park
> > > > > Supercomputing Center
> > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > Ulsan, 689-798, Korea
> > > > >
> > > > > phone : +82-52-217-4201
> > > > > mobile : +82-10-5094-0405
> > > > > fax : +82-52-217-4209
> > > > > ===========================
> > > > >
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > users at gridengine.org
> > > > > https://gridengine.org/mailman/listinfo/users
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ===========================
> > > > Sangmin Park
> > > > Supercomputing Center
> > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > Ulsan, 689-798, Korea
> > > >
> > > > phone : +82-52-217-4201
> > > > mobile : +82-10-5094-0405
> > > > fax : +82-52-217-4209
> > > > ===========================
> > > > _______________________________________________
> > > > users mailing list
> > > > users at gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> > >
> > >
> > > --
> > > ===========================
> > > Sangmin Park
> > > Supercomputing Center
> > > Ulsan National Institute of Science and Technology(UNIST)
> > > Ulsan, 689-798, Korea
> > >
> > > phone : +82-52-217-4201
> > > mobile : +82-10-5094-0405
> > > fax : +82-52-217-4209
> > > ===========================
> >
> >
> >
> >
> > --
> > ===========================
> > Sangmin Park
> > Supercomputing Center
> > Ulsan National Institute of Science and Technology(UNIST)
> > Ulsan, 689-798, Korea
> >
> > phone : +82-52-217-4201
> > mobile : +82-10-5094-0405
> > fax : +82-52-217-4209
> > ===========================
> 
> 
> 
> 
> -- 
> ===========================
> Sangmin Park 
> Supercomputing Center
> Ulsan National Institute of Science and Technology(UNIST)
> Ulsan, 689-798, Korea 
> 
> phone : +82-52-217-4201
> mobile : +82-10-5094-0405
> fax : +82-52-217-4209
> ===========================





More information about the users mailing list