[gridengine users] Email warning for s_rt ?
reuti at staff.uni-marburg.de
Sat Oct 20 09:02:04 UTC 2018
Am 19.10.2018 um 22:44 schrieb Dj Merrill:
> Hi all,
> Assuming a soft run time limit for a queue, is there a way to send an
> email warning when the job is about to hit the limit?
> For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run
You are aware, that this is a soft-soft limit. Means: I prefer a queue with a s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.
> time, is there a way to send an email at the 25 day mark to let the
> person know the job will be forced to end in 5 days?
The s_rt will have already the purpose to send a signal (SIGUSR1) before h_rt is reached. Please have a look at "RESOURCE LIMITS" in `man queue_conf`. So I wonder, whether the combined usage of s_rt and h_rt (both with the default -hard option) could already provide what you want to implement.
Sure, the SIGUSR1 must be caught in the script and masked out in the called binary to avoid that it's killed by the SIGUSR1 default behavior. I use a subshell for it:
trap "echo Foo" SIGUSR1
(trap - SIGUSR1; my_binary)
as the SIGUSR1 is send to the complete process tree of the job. The "echo Foo" could be replaced by`mail -s Warning …`.
> I've thought about trying to draft a script to do this, but thought I'd
> ask first if anyone else has come up with something.
A completely different approach: use a checkpoint interface to send email a warning. The interval given to `qsub -c 600:0:0 -ckpt mailer_only …` represents the 25 days, and the checkpointing interface "mailer_only" does not do any real checkpointing, but has a script defined for "ckpt_command" which sends an email (i.e. "interface application-level" must be used).
There is an introduction to use the checkpoint interface here: https://arc.liv.ac.uk/SGE/howto/checkpointing.html
More information about the users