[gridengine users] Queue limit s_rt / h_rt and CheckPoint

Joseph Farran jfarran at uci.edu
Thu Oct 31 20:24:11 UTC 2013


Not sure if there is a better way, but the following seems to be working.

In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
and then issuing a qmod suspend the job with:

function SIGUSR1_HANDLER()
{
     qmod -sj $JOB_ID
}
trap SIGUSR1_HANDLER  SIGUSR1

So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
the job via qmod.

Joseph


On 10/31/2013 11:48 AM, Joseph Farran wrote:
> Greetings.
>
> We have a queue defined with a soft & hard wall-clock limit of:
>
> qconf -sq free64 | egrep "_rt|notify"
> notify                00:05:00
> s_rt                  48:00:00
> h_rt                  48:05:00
>
> And jobs get killed correctly after 2 days of wall-clock run time. We now have Grid
> Engine checkpoint setup and would like to make it so that jobs do not get killed,
> but rather be sent the suspend signal so that checkpoint takes over instead of
> being killed.
>
> After reading and doing some tests with the queue "suspend_method", I am not
> sure I am on the right track.
>
> So what is the proper / correct way to do this?    To *not* have jobs killed but
> to have the checkpoint process take over when s_rt is reached?
>
> Joseph
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>




More information about the users mailing list