[gridengine users] Automatic job rescheduling. Only one rescheduling is happening

Ilya M 4ilya.m+grid at gmail.com
Fri Jun 8 21:46:53 UTC 2018


Hello,

I found an unexpected behavior when setting a hard and soft time limits and
doing automatic rescheduling by SIGUSR1.

This is my test script:

#!/bin/bash

#$ -S /bin/bash
#$ -l s_rt=0:0:5,h_rt=0:0:10
#$ -j y

set -x
set -e
set -o pipefail
set -u

trap "exit 99" SIGUSR1

trap "exit 2" SIGTERM

echo "hello world"

sleep 15

It should reschedule itself indefinitely when s_rt lapses. Yet, what is
happening is that rescheduling happens only once. On the second run the job
receives only SIGTERM and exits. Here is the script's output:

node140
+ set -e
+ set -o pipefail
+ set -u
+ trap 'exit 99' SIGUSR1
+ trap 'exit 2' SIGTERM
+ echo 'hello world'
hello world
+ sleep 15
User defined signal 1
++ exit 99
node069
+ set -e
+ set -o pipefail
+ set -u
+ trap 'exit 99' SIGUSR1
+ trap 'exit 2' SIGTERM
+ echo 'hello world'
hello world
+ sleep 15
Terminated
++ exit 2

Execd logs confirms that for the second time the jobs was killed for
exceeding h_rt:

06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft wallclock
time - initiate soft notify method
06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited with
exit status = 25

06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard wallclock
time - initiate terminate method

And here is the accounting information:

==============================================================
qname        short.q
hostname     node140
group        everyone
owner        ilya
project      project.p
department   defaultdepartment
jobname      reshed_test.sh
jobnumber    8030395
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jun  8 21:19:40 2018
start_time   Fri Jun  8 21:20:09 2018
end_time     Fri Jun  8 21:20:15 2018
granted_pe   NONE
slots        1
failed       25  : rescheduling
exit_status  99
ru_wallclock 6
...
==============================================================
qname        short.q
hostname     node069
group        everyone
owner        ilya
project      project.p
department   defaultdepartment
jobname      reshed_test.sh
jobnumber    8030395
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jun  8 21:19:40 2018
start_time   Fri Jun  8 21:21:39 2018
end_time     Fri Jun  8 21:21:50 2018
granted_pe   NONE
slots        1
failed       0
exit_status  2
ru_wallclock 11
...


Is there anything in the configuration I could be missing. Running 6.2u5.

Thank you,
Ilya.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20180608/7d77d8eb/attachment.html>


More information about the users mailing list