[gridengine users] Email warning for s_rt ?
sge at deej.net
Tue Oct 23 18:31:03 UTC 2018
Thank you for your response. I didn't describe our environment very
well, and I apologize. We only have one queue. We've had a few
instances of people forgetting they ran a job that doesn't apparently
have any stopping conditions, and am trying to come up with a way to
gently remind folks when they've left something running.
Current thoughts are to have the "sge_request" file contain:
-soft -l s_rt=720:0:0
We can tell them to use qalter to extend the time if they want, or they
can contact us to do it.
It would be nice if we could somehow parse the current s_rt on a job,
and 5 days before that time send out an email notification. If they
extend it to longer, we'd like it to again send out the notification 5
days before the new limit. In other words, something along the lines of
running a cron script every night that parses the running jobs, gets the
relevant info, and sends out an email notification if necessary.
In fact, we might not even need the s_rt limit set at all and an email
reminder at set intervals might be enough for our purposes, although
being able to have it auto terminate the job would save some manual effort.
What I'm asking for might not even be practical, but I thought it worth
a try to ask.
On 10/20/2018 05:02 AM, Reuti wrote:
> Am 19.10.2018 um 22:44 schrieb Dj Merrill:
>> Hi all,
>> Assuming a soft run time limit for a queue, is there a way to send an
>> email warning when the job is about to hit the limit?
>> For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run
> You are aware, that this is a soft-soft limit. Means: I prefer a queue with a s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.
>> time, is there a way to send an email at the 25 day mark to let the
>> person know the job will be forced to end in 5 days?
> The s_rt will have already the purpose to send a signal (SIGUSR1) before h_rt is reached. Please have a look at "RESOURCE LIMITS" in `man queue_conf`. So I wonder, whether the combined usage of s_rt and h_rt (both with the default -hard option) could already provide what you want to implement.
> Sure, the SIGUSR1 must be caught in the script and masked out in the called binary to avoid that it's killed by the SIGUSR1 default behavior. I use a subshell for it:
> trap "echo Foo" SIGUSR1
> (trap - SIGUSR1; my_binary)
> as the SIGUSR1 is send to the complete process tree of the job. The "echo Foo" could be replaced by`mail -s Warning …`.
>> I've thought about trying to draft a script to do this, but thought I'd
>> ask first if anyone else has come up with something.
> A completely different approach: use a checkpoint interface to send email a warning. The interval given to `qsub -c 600:0:0 -ckpt mailer_only …` represents the 25 days, and the checkpointing interface "mailer_only" does not do any real checkpointing, but has a script defined for "ckpt_command" which sends an email (i.e. "interface application-level" must be used).
> There is an introduction to use the checkpoint interface here: https://arc.liv.ac.uk/SGE/howto/checkpointing.html
> -- Reuti
More information about the users