[gridengine users] resource reservation problem

Chris Paciorek paciorek at stat.berkeley.edu
Mon May 20 23:50:23 UTC 2013


Ok, so we're still having trouble with reservations.  Per Reuti's
suggestion, I tried some tests where we set h_rt and -hard for all jobs, so
SGE should have enough information to manage the reservations. In
particular, here's the setting:

1) We have 32-core job with h_rt = 10 days
qsub -pe smp 32 -hard -l h_rt=240:00:00 tmp.long.sh

2) We have 500 different 8-core jobs with h_rt = 1 day
for((it = 1 ; it<500; it++)); do qsub -pe smp 8 -hard -l h_rt=24:00:00
tmp.long.sh; done

3) A different user then submits a 16-core jobs with h_rt = 30 minutes,
specifying a reservation; this job immediately jumps to the top of the
queue and shows that is is reserving on a specific node (in particular,
eml-sm11.berkeley.edu)
hicks.scf1% qsub -pe smp 16 -hard -l h_rt=00:30:00 -R y tmp.long.sh

Here's the queue initially:

hicks.paciorek$ qstat -u "*" | head -n 40
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    786 0.75000 tmp.long.s paciorek     r     05/20/2013 16:21:13
low.q at eml-sm02.Berkeley.EDU       32
    787 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm11.Berkeley.EDU        8
    788 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm11.Berkeley.EDU        8
    789 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm11.Berkeley.EDU        8
    790 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm11.Berkeley.EDU        8
    791 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm01.Berkeley.EDU        8
    792 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm01.Berkeley.EDU        8
    793 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm01.Berkeley.EDU        8
    794 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm01.Berkeley.EDU        8
    795 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm00.Berkeley.EDU        8
    796 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm00.Berkeley.EDU        8
    797 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm00.Berkeley.EDU        8
    798 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm00.Berkeley.EDU        8
    799 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm12.Berkeley.EDU        8
    800 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm12.Berkeley.EDU        8
    801 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm12.Berkeley.EDU        8
    802 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm12.Berkeley.EDU        8
    803 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm03.Berkeley.EDU        8
    804 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm03.Berkeley.EDU        8
    805 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm03.Berkeley.EDU        8
    806 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm03.Berkeley.EDU        8
    807 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm10.Berkeley.EDU        8
    808 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm10.Berkeley.EDU        8
    809 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm10.Berkeley.EDU        8
    810 0.25005 tmp.long.s paciorek     r     05/20/2013 16:23:43
low.q at eml-sm10.Berkeley.EDU        8
   1288 0.41297 tmp.long.s scf1         qw    05/20/2013
16:33:33                                   16
    811 0.25005 tmp.long.s paciorek     qw    05/20/2013
16:23:37                                    8
    812 0.25005 tmp.long.s paciorek     qw    05/20/2013
16:23:37                                    8
    813 0.25005 tmp.long.s paciorek     qw    05/20/2013
16:23:37                                    8
    814 0.25005 tmp.long.s paciorek     qw    05/20/2013
16:23:37                                    8


I was hoping/expecting that once jobs on the node on which the reservation
is occurring finished (or in this case were explicitly killed by the
administrator so I had more control over things), then slots would be
accumulated for the 16-core job. Instead a few other things happened that I
don't understand and that result in the reservation not being effective.

a) When I deleted a job on the node on which the reservation was occurring,
in some cases the reservation got moved to a different node and one of the
lower-priority jobs started on the node with the free resources.

Here's the end of the schedule file initially:

808:1:RUNNING:1369092223:86460:P:smp:slots:8.000000
808:1:RUNNING:1369092223:86460:Q:low.q at eml-sm10.Berkeley.EDU:slots:8.000000
809:1:RUNNING:1369092223:86460:P:smp:slots:8.000000
809:1:RUNNING:1369092223:86460:Q:low.q at eml-sm10.Berkeley.EDU:slots:8.000000
810:1:RUNNING:1369092223:86460:P:smp:slots:8.000000
810:1:RUNNING:1369092223:86460:Q:low.q at eml-sm10.Berkeley.EDU:slots:8.000000
1288:1:RESERVING:1369178683:1860:P:smp:slots:16.000000
1288:1:RESERVING:1369178683:1860:Q:low.q at eml-sm11.Berkeley.EDU:
slots:16.000000

I then kill a job on eml-sm11.berkeley.edu and here's the schedule file
after that:

810:1:RUNNING:1369092223:86460:P:smp:slots:8.000000
810:1:RUNNING:1369092223:86460:Q:low.q at eml-sm10.Berkeley.EDU:slots:8.000000
811:1:RUNNING:1369092889:86460:P:smp:slots:8.000000
811:1:RUNNING:1369092889:86460:Q:low.q at eml-sm01.Berkeley.EDU:slots:8.000000
812:1:RUNNING:1369092934:86460:P:smp:slots:8.000000
812:1:RUNNING:1369092934:86460:Q:low.q at eml-sm11.Berkeley.EDU:slots:8.000000
1288:1:RESERVING:1369178683:1860:P:smp:slots:16.000000
1288:1:RESERVING:1369178683:1860:Q:low.q at eml-sm00.Berkeley.EDU:
slots:16.000000

Notice that job #812 is now running on eml-sm11.berkeley.edu and the
16-core job (#1288) has its reservation shifted to eml-sm00.berkeley.edu

b) Later during the test, one of the 8-core jobs slipped in front of the
16-core job on the node on which the reservation was happening: here's the
schedule file output showing the reservation on eml-sm00.berkeley.edu but
job #813 starting anyway:

hicks.scf1% tail -n 10 /var/lib/gridengine/default/common/schedule
810:1:RUNNING:1369092223:86460:P:smp:slots:8.000000
810:1:RUNNING:1369092223:86460:Q:low.q at eml-sm10.Berkeley.EDU:slots:8.000000
811:1:RUNNING:1369092889:86460:P:smp:slots:8.000000
811:1:RUNNING:1369092889:86460:Q:low.q at eml-sm01.Berkeley.EDU:slots:8.000000
812:1:RUNNING:1369092934:86460:P:smp:slots:8.000000
812:1:RUNNING:1369092934:86460:Q:low.q at eml-sm11.Berkeley.EDU:slots:8.000000
1288:1:RESERVING:1369178683:1860:P:smp:slots:16.000000
1288:1:RESERVING:1369178683:1860:Q:low.q at eml-sm00.Berkeley.EDU:
slots:16.000000
813:1:STARTING:1369093039:86460:P:smp:slots:8.000000
813:1:STARTING:1369093039:86460:Q:low.q at eml-sm00.Berkeley.EDU:slots:8.000000

Can anyone shed any light on what is going on?

-Chris



On Mon, May 13, 2013 at 11:51 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Am 14.05.2013 um 02:33 schrieb Chris Paciorek:
>
> > I tried submitting a job with h_rt requested for 30 minutes.
> > qsub -pe smp 16 -R y -l h_rt=30 -b y "R CMD BATCH --no-save tmp.R
> tmp.out"
>
> This would be 30 seconds. 30 minutes can be specified as ":30:".
>
>
> > Our default_duration is still set at 7200 hours.
>
> So all other jobs still haven't any h_rt/s_rt set? I meant to request a
> proper time for all jobs, especially for these which are running shorter
> than 300 days. Even if a queue has a time limit, it won't be taken into
> account when the reservation is made. So the reserved node might be having
> a job running which will end first:
>
> >  33004 0.06039 tophat.sh  seqc         r     04/24/2013 07:14:20
> low.q at scf-sm02.Berkeley.EDU       32
>
> will end before the jobs in scf-sm01 and scf-sm03 when all would run for
> 300 days.
>
> -- Reuti
>
>
>
> > The submitted job is at the top of the queue (see below) but jobs
> requesting fewer cores are slipping ahead of the job with the reservation.
> I believe this is happening because the reservation was placed on node
> scf-sm02. Here are the relevant lines from the schedule file:
> > 34640:1:RESERVING:1369228520:90:P:smp:slots:16.000000
> > 34640:1:RESERVING:1369228520:90:Q:low.q at scf-sm02.Berkeley.EDU:
> slots:16.000000
> >
> > So it seems that what is happening is that SGE has decided to put the
> reservation on node scf-sm02, which has the longest running current job (#
> 33004), perhaps because based on the expected_duration of 7200 hours it
> expects that job to finish first amongst all running jobs. Then when jobs
> on other nodes finish, the reservation is not applied to those other nodes
> and so jobs slip ahead of the job that has requested the reservation.
>  Here's a snapshot of the queue after job #34640 was submitted with a
> reservation attached to it. Shortly after this snapshot, job # 34333
> started on node scf-sm03, despite the reservation for job # 34640.
> >
> > Any thoughts on whether this understanding is correct?
> >
> >
> > job-ID  prior   name       user         state submit/start at     queue
>                          slots ja-task-ID
> >
> -----------------------------------------------------------------------------------------------------------------
> >   33004 0.06039 tophat.sh  seqc         r     04/24/2013 07:14:20
> low.q at scf-sm02.Berkeley.EDU       32
> >   34321 0.00211 SubSampleF isoform      r     05/13/2013 08:52:27
> low.q at scf-sm01.Berkeley.EDU        8
> >   34322 0.00211 SubSampleF isoform      r     05/13/2013 09:05:42
> low.q at scf-sm03.Berkeley.EDU        8
> >   34323 0.00211 SubSampleF isoform      r     05/13/2013 09:28:42
> low.q at scf-sm00.Berkeley.EDU        8
> >   34324 0.00211 SubSampleF isoform      r     05/13/2013 09:41:42
> low.q at scf-sm03.Berkeley.EDU        8
> >   34325 0.00211 SubSampleF isoform      r     05/13/2013 09:57:12
> low.q at scf-sm00.Berkeley.EDU        8
> >   34326 0.00211 SubSampleF isoform      r     05/13/2013 10:15:12
> low.q at scf-sm00.Berkeley.EDU        8
> >   34327 0.00211 SubSampleF isoform      r     05/13/2013 10:56:27
> low.q at scf-sm01.Berkeley.EDU        8
> >   34328 0.00211 SubSampleF isoform      r     05/13/2013 11:00:12
> low.q at scf-sm03.Berkeley.EDU        8
> >   34329 0.00211 SubSampleF isoform      r     05/13/2013 11:01:57
> low.q at scf-sm01.Berkeley.EDU        8
> >   34330 0.00211 SubSampleF isoform      r     05/13/2013 12:09:27
> low.q at scf-sm03.Berkeley.EDU        8
> >   34331 0.00211 SubSampleF isoform      r     05/13/2013 12:35:57
> low.q at scf-sm00.Berkeley.EDU        8
> >   34332 0.00211 SubSampleF isoform      r     05/13/2013 13:18:27
> low.q at scf-sm01.Berkeley.EDU        8
> >   34397 0.68717 tnBoot.sh  haiyanh      r     05/09/2013 17:45:02
> high.q at scf-sm01.Berkeley.EDU       8
> >   34613 0.03245 run_japan. lwtai        r     05/11/2013 23:52:39
> high.q at scf-sm02.Berkeley.EDU       1
> >   34614 0.03245 run_japan. lwtai        r     05/11/2013 23:52:39
> high.q at scf-sm02.Berkeley.EDU       1
> >   34615 0.03245 run_japan. lwtai        r     05/11/2013 23:52:39
> high.q at scf-sm02.Berkeley.EDU       1
> >   34616 0.03245 run_japan. lwtai        r     05/11/2013 23:52:39
> high.q at scf-sm02.Berkeley.EDU       1
> >   34633 0.03245 run2       lwtai        r     05/13/2013 09:36:27
> high.q at scf-sm01.Berkeley.EDU       1
> >   34648 0.03245 run_data   lwtai        r     05/13/2013 16:24:25
> high.q at scf-sm02.Berkeley.EDU       5
> >   34649 0.03245 run3       lwtai        r     05/13/2013 16:48:10
> high.q at scf-sm00.Berkeley.EDU       2
> >   34640 1.00000 R CMD BATC paciorek     qw    05/13/2013 13:45:13
>                             16
> >   34333 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34334 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34335 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34336 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34337 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34338 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34339 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34340 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34341 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34342 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:56
>                              8
> >   34343 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34344 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34345 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34346 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34347 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34348 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >   34349 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:57
>                              8
> >
> >
> >
> > On Fri, May 10, 2013 at 6:10 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > Hi,
> >
> > Am 10.05.2013 um 00:35 schrieb Chris Paciorek:
> >
> > > For the (default) queue [called low.q] that these jobs are going to,
> we have the time limit set to 28 days (see below). Users are not explicitly
> requesting h_rt/s_rt. The jobs that are slipping ahead of the reserved job
> are not actually jobs that are short in time, and SGE shouldn't have any
> way of thinking that they are.
> >
> > https://arc.liv.ac.uk/trac/SGE/ticket/388
> >
> > http://gridengine.org/pipermail/users/2012-July/004104.html
> >
> > Without an explicit request the default runtime will be assumed for all
> jobs.
> >
> > The jobs 34195-34198 weren't started at once, but one after the other. I
> would say the jobs running before them on node scf-sm01 resp. scf-sm03 were
> shorther than the extimated 7200 hrs. Can you please give it a try to
> submit shorter job with an explicitly requested h_rt and check whether it
> changes anything.
> >
> > -- Reuti
> >
> >
> > > I'm starting to suspect that the issue may be that the reservation
> seems to be hard-wired to individual nodes, and in our case it is being
> hard-wired to a node with the longest-running job, while other jobs on
> other nodes are finishing more quickly. I suppose this makes sense - in
> order to collect sufficient cores for a reservation, it needs to do so on a
> single node, so at some point, it needs to  decide which node that will be.
> Unfortunately in this case, it's immediately choosing the node with the
> long-running job as soon as the reservation is requested, but that
> long-running job is likely to continue to run for a while. Can anyone weigh
> in on whether this sounds right and if so, any ideas to deal with this?
> > >
> > > beren:~$ qconf -sq low.q
> > > qname                 low.q
> > > hostlist              @sm0
> > > seq_no                0
> > > load_thresholds       np_load_avg=1.75
> > > suspend_thresholds    NONE
> > > nsuspend              1
> > > suspend_interval      00:05:00
> > > priority              19
> > > min_cpu_interval      00:05:00
> > > processors            UNDEFINED
> > > qtype                 BATCH
> > > ckpt_list             NONE
> > > pe_list               smp smpcontrol
> > > rerun                 FALSE
> > > slots                 32
> > > tmpdir                /tmp
> > > shell                 /bin/bash
> > > prolog                NONE
> > > epilog                NONE
> > > shell_start_mode      posix_compliant
> > > starter_method        NONE
> > > suspend_method        NONE
> > > resume_method         NONE
> > > terminate_method      NONE
> > > notify                00:00:60
> > > owner_list            NONE
> > > user_lists            sm0users
> > > xuser_lists           NONE
> > > subordinate_list      NONE
> > > complex_values        NONE
> > > projects              NONE
> > > xprojects             NONE
> > > calendar              NONE
> > > initial_state         default
> > > s_rt                  671:00:00
> > > h_rt                  672:00:00
> > > s_cpu                 INFINITY
> > > h_cpu                 INFINITY
> > > s_fsize               INFINITY
> > > h_fsize               INFINITY
> > > s_data                INFINITY
> > > h_data                INFINITY
> > > s_stack               INFINITY
> > > h_stack               INFINITY
> > > s_core                INFINITY
> > > h_core                INFINITY
> > > s_rss                 INFINITY
> > > h_rss                 INFINITY
> > > s_vmem                INFINITY
> > > h_vmem                INFINITY
> > >
> > >
> > >
> > > beren:~$ qconf -sc
> > > #name               shortcut   type        relop requestable
> consumable default  urgency
> > >
> #----------------------------------------------------------------------------------------
> > > arch                a          RESTRING    ==    YES         NO
>   NONE     0
> > > calendar            c          RESTRING    ==    YES         NO
>   NONE     0
> > > cpu                 cpu        DOUBLE      >=    YES         NO
>   0        0
> > > display_win_gui     dwg        BOOL        ==    YES         NO
>   0        0
> > > h_core              h_core     MEMORY      <=    YES         NO
>   0        0
> > > h_cpu               h_cpu      TIME        <=    YES         NO
>   0:0:0    0
> > > h_data              h_data     MEMORY      <=    YES         NO
>   0        0
> > > h_fsize             h_fsize    MEMORY      <=    YES         NO
>   0        0
> > > h_rss               h_rss      MEMORY      <=    YES         NO
>   0        0
> > > h_rt                h_rt       TIME        <=    YES         NO
>   0:0:0    0
> > > h_stack             h_stack    MEMORY      <=    YES         NO
>   0        0
> > > h_vmem              h_vmem     MEMORY      <=    YES         NO
>   0        0
> > > hostname            h          HOST        ==    YES         NO
>   NONE     0
> > > load_avg            la         DOUBLE      >=    NO          NO
>   0        0
> > > load_long           ll         DOUBLE      >=    NO          NO
>   0        0
> > > load_medium         lm         DOUBLE      >=    NO          NO
>   0        0
> > > load_short          ls         DOUBLE      >=    NO          NO
>   0        0
> > > m_core              core       INT         <=    YES         NO
>   0        0
> > > m_socket            socket     INT         <=    YES         NO
>   0        0
> > > m_topology          topo       RESTRING    ==    YES         NO
>   NONE     0
> > > m_topology_inuse    utopo      RESTRING    ==    YES         NO
>   NONE     0
> > > mem_free            mf         MEMORY      <=    YES         NO
>   0        0
> > > mem_total           mt         MEMORY      <=    YES         NO
>   0        0
> > > mem_used            mu         MEMORY      >=    YES         NO
>   0        0
> > > min_cpu_interval    mci        TIME        <=    NO          NO
>   0:0:0    0
> > > np_load_avg         nla        DOUBLE      >=    NO          NO
>   0        0
> > > np_load_long        nll        DOUBLE      >=    NO          NO
>   0        0
> > > np_load_medium      nlm        DOUBLE      >=    NO          NO
>   0        0
> > > np_load_short       nls        DOUBLE      >=    NO          NO
>   0        0
> > > num_proc            p          INT         ==    YES         NO
>   0        0
> > > qname               q          RESTRING    ==    YES         NO
>   NONE     0
> > > rerun               re         BOOL        ==    NO          NO
>   0        0
> > > s_core              s_core     MEMORY      <=    YES         NO
>   0        0
> > > s_cpu               s_cpu      TIME        <=    YES         NO
>   0:0:0    0
> > > s_data              s_data     MEMORY      <=    YES         NO
>   0        0
> > > s_fsize             s_fsize    MEMORY      <=    YES         NO
>   0        0
> > > s_rss               s_rss      MEMORY      <=    YES         NO
>   0        0
> > > s_rt                s_rt       TIME        <=    YES         NO
>   0:0:0    0
> > > s_stack             s_stack    MEMORY      <=    YES         NO
>   0        0
> > > s_vmem              s_vmem     MEMORY      <=    YES         NO
>   0        0
> > > seq_no              seq        INT         ==    NO          NO
>   0        0
> > > slots               s          INT         <=    YES         YES
>  1        1000
> > > swap_free           sf         MEMORY      <=    YES         NO
>   0        0
> > > swap_rate           sr         MEMORY      >=    YES         NO
>   0        0
> > > swap_rsvd           srsv       MEMORY      >=    YES         NO
>   0        0
> > > swap_total          st         MEMORY      <=    YES         NO
>   0        0
> > > swap_used           su         MEMORY      >=    YES         NO
>   0        0
> > > tmpdir              tmp        RESTRING    ==    NO          NO
>   NONE     0
> > > virtual_free        vf         MEMORY      <=    YES         YES
>  0        0
> > > virtual_total       vt         MEMORY      <=    YES         NO
>   0        0
> > > virtual_used        vu         MEMORY      >=    YES         NO
>   0        0
> > >
> > >
> > >
> > > On Thu, May 9, 2013 at 10:43 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > > Am 09.05.2013 um 18:51 schrieb Chris Paciorek:
> > >
> > > > We're having a problem similar to that described in this thread:
> > > >
> http://www.mentby.com/Group/grid-engine/62u4-resource-reservation-not-working-for-some-jobs.html
> > > >
> > > > We're running Grid Engine 6.2u5 for a cluster of 4 Linux nodes (32
> cores each) running Ubuntu 12.04 (Precise).
> > > >
> > > > We're seeing that jobs that request a reservation and are at the top
> of the queue are not starting, with lower-priority jobs that are requesting
> fewer cores slipping ahead of the higher priority job. An example of this
> is at the bottom of this posting.
> > >
> > > Besides the defined "default_duration 7200:00:00": what h_rt/s_rt
> request was supplied to the short jobs?
> > >
> > > -- Reuti
> > >
> > >
> > > > Here's the results of "qconf -ssconf":
> > > > algorithm                         default
> > > > schedule_interval                 0:0:15
> > > > maxujobs                          0
> > > > queue_sort_method                 load
> > > > job_load_adjustments              np_load_avg=0.50
> > > > load_adjustment_decay_time        0:7:30
> > > > load_formula                      np_load_avg
> > > > schedd_job_info                   true
> > > > flush_submit_sec                  0
> > > > flush_finish_sec                  0
> > > > params                            MONITOR=1
> > > > reprioritize_interval             0:0:0
> > > > halftime                          720
> > > > usage_weight_list
> cpu=1.000000,mem=0.000000,io=0.000000
> > > > compensation_factor               5.000000
> > > > weight_user                       0.250000
> > > > weight_project                    0.250000
> > > > weight_department                 0.250000
> > > > weight_job                        0.250000
> > > > weight_tickets_functional         0
> > > > weight_tickets_share              100000
> > > > share_override_tickets            TRUE
> > > > share_functional_shares           TRUE
> > > > max_functional_jobs_to_schedule   200
> > > > report_pjob_tickets               TRUE
> > > > max_pending_tasks_per_job         50
> > > > halflife_decay_list               none
> > > > policy_hierarchy                  SOF
> > > > weight_ticket                     1.000000
> > > > weight_waiting_time               0.278000
> > > > weight_deadline                   3600000.000000
> > > > weight_urgency                    0.000000
> > > > weight_priority                   0.000000
> > > > max_reservation                   10
> > > > default_duration                  7200:00:00
> > > >
> > > > Here's the example:
> > > >
> > > > Job #34378 was submitted as:
> > > > qsub -pe smp 16 -R y -b y "R CMD BATCH --no-save tmp.R tmp.out"
> > > >
> > > >
> > > > Soon after submitting #34378, we see that the job #34378 is next in
> line:
> > > > job-ID  prior   name       user         state submit/start at
> queue                          slots ja-task-ID
> > > >
> -----------------------------------------------------------------------------------------------------------------
> > > >   33004 0.11762 tophat.sh  seqc         r     04/24/2013 07:14:20
> low.q at scf-sm02.Berkeley.EDU       32
> > > >   33718 0.12405 fooSU_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33719 0.12405 fooSV_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33720 0.12405 fooWV_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33721 0.12405 fooWU_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33745 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:29:28
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   33758 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:30:28
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   33763 0.06583 toy.sh     yjhuoh       r     05/07/2013 22:33:58
> low.q at scf-sm03.Berkeley.EDU        1
> > > >   33787 0.06583 toy.sh     yjhuoh       r     05/08/2013 00:15:58
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   33794 0.06583 toy.sh     yjhuoh       r     05/08/2013 01:45:58
> low.q at scf-sm03.Berkeley.EDU        1
> > > >   34183 0.00570 SubSampleF isoform      r     05/09/2013 03:29:32
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34185 0.00570 SubSampleF isoform      r     05/09/2013 04:27:47
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34186 0.00570 SubSampleF isoform      r     05/09/2013 04:36:47
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34187 0.00570 SubSampleF isoform      r     05/09/2013 05:05:02
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34188 0.00570 SubSampleF isoform      r     05/09/2013 05:42:17
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34189 0.00570 SubSampleF isoform      r     05/09/2013 06:12:47
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34190 0.00570 SubSampleF isoform      r     05/09/2013 06:14:17
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34191 0.00570 SubSampleF isoform      r     05/09/2013 07:07:32
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34192 0.00570 SubSampleF isoform      r     05/09/2013 07:24:02
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34194 0.00570 SubSampleF isoform      r     05/09/2013 07:37:17
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34378 1.00000 R CMD BATC paciorek     qw    05/09/2013 08:14:31
>                                 16
> > > >   34195 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34196 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34197 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34198 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34199 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34200 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34201 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34202 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34203 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34204 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34205 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34206 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34207 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34208 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34209 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34210 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >
> > > > A little while later, we see that jobs 34195-34198 have slipped
> ahead of 34378:
> > > >
> > > > job-ID  prior   name       user         state submit/start at
> queue                          slots ja-task-ID
> > > >
> -----------------------------------------------------------------------------------------------------------------
> > > >   33004 0.11790 tophat.sh  seqc         r     04/24/2013 07:14:20
> low.q at scf-sm02.Berkeley.EDU       32
> > > >   33718 0.12398 fooSU_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33719 0.12398 fooSV_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33720 0.12398 fooWV_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33721 0.12398 fooWU_long lwtai        r     05/06/2013 17:01:58
> high.q at scf-sm01.Berkeley.EDU       1
> > > >   33745 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:29:28
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   33758 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:30:28
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   33763 0.08234 toy.sh     yjhuoh       r     05/07/2013 22:33:58
> low.q at scf-sm03.Berkeley.EDU        1
> > > >   33787 0.08234 toy.sh     yjhuoh       r     05/08/2013 00:15:58
> low.q at scf-sm00.Berkeley.EDU        1
> > > >   34188 0.00568 SubSampleF isoform      r     05/09/2013 05:42:17
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34189 0.00568 SubSampleF isoform      r     05/09/2013 06:12:47
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34190 0.00568 SubSampleF isoform      r     05/09/2013 06:14:17
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34191 0.00568 SubSampleF isoform      r     05/09/2013 07:07:32
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34192 0.00568 SubSampleF isoform      r     05/09/2013 07:24:02
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34194 0.00568 SubSampleF isoform      r     05/09/2013 07:37:17
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34195 0.00568 SubSampleF isoform      r     05/09/2013 08:16:47
> low.q at scf-sm03.Berkeley.EDU        8
> > > >   34196 0.00568 SubSampleF isoform      r     05/09/2013 08:47:32
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34197 0.00568 SubSampleF isoform      r     05/09/2013 09:11:02
> low.q at scf-sm00.Berkeley.EDU        8
> > > >   34198 0.00568 SubSampleF isoform      r     05/09/2013 09:16:32
> low.q at scf-sm01.Berkeley.EDU        8
> > > >   34378 1.00000 R CMD BATC paciorek     qw    05/09/2013 08:14:31
>                                 16
> > > >   34199 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34200 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34201 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34202 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34203 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34204 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34205 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34206 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34207 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34208 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34209 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34210 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:51
>                                  8
> > > >   34211 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34212 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34213 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34214 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34215 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34216 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34217 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >   34218 0.00000 SubSampleF isoform      qw    05/08/2013 19:30:52
>                                  8
> > > >
> > > > The schedule file shows that there are RESERVING statements for
> #34378:
> > > > 34378:1:RESERVING:1369228520:25920060:P:smp:slots:16.000000
> > > > 34378:1:RESERVING:1369228520:25920060:Q:low.q at scf-sm02.Berkeley.EDU:
> slots:16.000000
> > > >
> > > > Perhaps the issue is that the reservation seems specific to the
> cluster node "scf-sm02.Berkeley.EDU", and that specific node is occupied
> by a long-running job (#33004). If so, is there any way to have the
> reservation not tied to a node?
> > > >
> > > > -Chris
> > > >
> > > >
> ----------------------------------------------------------------------------------------------
> > > > Chris Paciorek
> > > >
> > > > Statistical Computing Consultant, Associate Research Statistician,
> Lecturer
> > > >
> > > > Office: 495 Evans Hall                      Email:
> paciorek at stat.berkeley.edu
> > > > Mailing Address:                            Voice: 510-842-6670
> > > > Department of Statistics                    Fax:   510-642-7892
> > > > 367 Evans Hall                              Skype: cjpaciorek
> > > > University of California, Berkeley          WWW:
> www.stat.berkeley.edu/~paciorek
> > > > Berkeley, CA 94720 USA                      Permanent forward:
> paciorek at alumni.cmu.edu
> > > >
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users at gridengine.org
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20130520/2f461d12/attachment.html>


More information about the users mailing list