[gridengine users] Automatic job rescheduling. Only one rescheduling is happening

Ilya M 4ilya.m+grid at gmail.com
Mon Jun 11 17:35:26 UTC 2018


Re-reading the man page yet another time made me think that this is the
desired and logical behavior: if the job id remains the same, then h_rt and
s_rt counters cannot be reset: job starts only once, execution *continues*
after re-scheduing:

"RESOURCE LIMITS
       The first two resource limit parameters, s_rt and h_rt, are
implemented by Grid Engine. They define the "real time" or also called
"elapsed" or "wall clock"  time
*having       passed since the start of the job*...'

Ilya.

On Mon, Jun 11, 2018 at 9:57 AM, Reuti <reuti at staff.uni-marburg.de> wrote:

>
> > Am 11.06.2018 um 18:43 schrieb Ilya M <4ilya.m+grid at gmail.com>:
> >
> > Hello,
> >
> > Thank you for the suggestion, Reuti. Not sure if my users' pipelines can
> deal with multiple job ids, perhaps they will be willing to modify their
> code.
>
> Also other commands in SGE like `qdel` allow to use the job name to deal
> with such a configuration.
>
>
> > On Mon, Jun 11, 2018 at 9:23 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > Hi,
> >
> >
> > I wouldn't be surprised if the execd remembers that the job was already
> warned, hence it must be the hard limit now. Would your workflow allow:
> >
> > This is happening on different nodes, so each execd cannot know any
> history by itself, the master must be providing this information.
>
> Aha, you correct.
>
> -- Reuti
>
>
> > Can't help wondering if this is a configurable option.
> >
> > Ilya.
> >
> >
> >
> > . /usr/sge/default/common/settings.sh
> > trap "qresub $JOB_ID; exit 4;" SIGUSR1
> >
> > Well, you get several job numbers this way. For the accounting with
> `qacct` you could use the job name instead of the job number to get all the
> runs listed though.
> >
> > -- Reuti
> >
> >
> > > This is my test script:
> > >
> > > #!/bin/bash
> > >
> > > #$ -S /bin/bash
> > > #$ -l s_rt=0:0:5,h_rt=0:0:10
> > > #$ -j y
> > >
> > > set -x
> > > set -e
> > > set -o pipefail
> > > set -u
> > >
> > > trap "exit 99" SIGUSR1
> > >
> > > trap "exit 2" SIGTERM
> > >
> > > echo "hello world"
> > >
> > > sleep 15
> > >
> > > It should reschedule itself indefinitely when s_rt lapses. Yet, what
> is happening is that rescheduling happens only once. On the second run the
> job receives only SIGTERM and exits. Here is the script's output:
> > >
> > > node140
> > > + set -e
> > > + set -o pipefail
> > > + set -u
> > > + trap 'exit 99' SIGUSR1
> > > + trap 'exit 2' SIGTERM
> > > + echo 'hello world'
> > > hello world
> > > + sleep 15
> > > User defined signal 1
> > > ++ exit 99
> > > node069
> > > + set -e
> > > + set -o pipefail
> > > + set -u
> > > + trap 'exit 99' SIGUSR1
> > > + trap 'exit 2' SIGTERM
> > > + echo 'hello world'
> > > hello world
> > > + sleep 15
> > > Terminated
> > > ++ exit 2
> > >
> > > Execd logs confirms that for the second time the jobs was killed for
> exceeding h_rt:
> > >
> > > 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft
> wallclock time - initiate soft notify method
> > > 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited
> with exit status = 25
> > >
> > > 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard
> wallclock time - initiate terminate method
> > >
> > > And here is the accounting information:
> > >
> > > ==============================================================
> > > qname        short.q
> > > hostname     node140
> > > group        everyone
> > > owner        ilya
> > > project      project.p
> > > department   defaultdepartment
> > > jobname      reshed_test.sh
> > > jobnumber    8030395
> > > taskid       undefined
> > > account      sge
> > > priority     0
> > > qsub_time    Fri Jun  8 21:19:40 2018
> > > start_time   Fri Jun  8 21:20:09 2018
> > > end_time     Fri Jun  8 21:20:15 2018
> > > granted_pe   NONE
> > > slots        1
> > > failed       25  : rescheduling
> > > exit_status  99
> > > ru_wallclock 6
> > > ...
> > > ==============================================================
> > > qname        short.q
> > > hostname     node069
> > > group        everyone
> > > owner        ilya
> > > project      project.p
> > > department   defaultdepartment
> > > jobname      reshed_test.sh
> > > jobnumber    8030395
> > > taskid       undefined
> > > account      sge
> > > priority     0
> > > qsub_time    Fri Jun  8 21:19:40 2018
> > > start_time   Fri Jun  8 21:21:39 2018
> > > end_time     Fri Jun  8 21:21:50 2018
> > > granted_pe   NONE
> > > slots        1
> > > failed       0
> > > exit_status  2
> > > ru_wallclock 11
> > > ...
> > >
> > >
> > > Is there anything in the configuration I could be missing. Running
> 6.2u5.
> > >
> > > Thank you,
> > > Ilya.
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20180611/b01ff9d2/attachment.html>


More information about the users mailing list