[gridengine users] Queue limit s_rt / h_rt and CheckPoint
jfarran at uci.edu
Thu Oct 31 20:24:11 UTC 2013
Not sure if there is a better way, but the following seems to be working.
In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
and then issuing a qmod suspend the job with:
qmod -sj $JOB_ID
trap SIGUSR1_HANDLER SIGUSR1
So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
the job via qmod.
On 10/31/2013 11:48 AM, Joseph Farran wrote:
> We have a queue defined with a soft & hard wall-clock limit of:
> qconf -sq free64 | egrep "_rt|notify"
> notify 00:05:00
> s_rt 48:00:00
> h_rt 48:05:00
> And jobs get killed correctly after 2 days of wall-clock run time. We now have Grid
> Engine checkpoint setup and would like to make it so that jobs do not get killed,
> but rather be sent the suspend signal so that checkpoint takes over instead of
> being killed.
> After reading and doing some tests with the queue "suspend_method", I am not
> sure I am on the right track.
> So what is the proper / correct way to do this? To *not* have jobs killed but
> to have the checkpoint process take over when s_rt is reached?
> users mailing list
> users at gridengine.org
More information about the users