[gridengine users] Using multiple queues inherits s_rt & h_rt

Joseph Farran jfarran at uci.edu
Thu May 28 19:27:07 UTC 2015


Hi all.

I am not sure if this is a bug or the way Grid Engine works.

We have several queues our users submit jobs to.    One of the queues 
"free64" has a 3-day wall-clock limit:

$ qconf -sq free64 | grep "_rt"
s_rt                  72:00:00
h_rt                  72:05:00

While other queue "bio" does not:

$ qconf -sq bio | grep "_rt"
s_rt                  INFINITY
h_rt                  INFINITY

When a user submits a job to both queues  "-q free64,bio", jobs that run 
longer than 3 days are killed whether they land on "free64" or "bio" 
queue.    Why are jobs that land on the "bio" queue being killed after 3 
days?

The jobs are also using GE checkpoint restart:

$ qconf -sckpt restart
ckpt_name          restart
interface          USERDEFINED
ckpt_command       NONE
migr_command       NONE
restart_command    NONE
clean_command      none
ckpt_dir           $SGE_O_WORKDIR
signal             usr1
when               xsr

Is it that checkpoint restart the cause of this?    I am guessing that a 
job that landed first on free64 queue picked-up the 3-days wall-clock 
limit and when it is restarted on the bio queue, it inherited the 
wall-clock 3-days limit from free64?    If this is what is happening, is 
this a bug?    Is there a workaround?

Joseph


More information about the users mailing list