[gridengine users] Email warning for s_rt ?
reuti at staff.uni-marburg.de
Tue Oct 23 20:11:14 UTC 2018
> Am 23.10.2018 um 20:31 schrieb Dj Merrill <sge at deej.net>:
> Hi Reuti,
> Thank you for your response. I didn't describe our environment very
> well, and I apologize. We only have one queue. We've had a few
> instances of people forgetting they ran a job that doesn't apparently
> have any stopping conditions, and am trying to come up with a way to
> gently remind folks when they've left something running.
> Current thoughts are to have the "sge_request" file contain:
> -soft -l s_rt=720:0:0
> We can tell them to use qalter to extend the time if they want, or they
> can contact us to do it.
This won't work in SGE. The limits are set when the job starts. The only way to extend a runtime limit is to -softstop the execd on the particular node (with the side effect that no more jobs will scheduled thereto until is is restarted). And restart the execd once the job granted to run longer than exstimated came to an end.
> It would be nice if we could somehow parse the current s_rt on a job,
> and 5 days before that time send out an email notification. If they
> extend it to longer, we'd like it to again send out the notification 5
> days before the new limit. In other words, something along the lines of
> running a cron script every night that parses the running jobs, gets the
> relevant info, and sends out an email notification if necessary.
> In fact, we might not even need the s_rt limit set at all and an email
> reminder at set intervals might be enough for our purposes, although
> being able to have it auto terminate the job would save some manual effort.
I would sugguest to store such arbitrary information in a job context like "qsub -ac ESTIMATED_RUNTIME=720". Reading your complete description of the set up, I get the impression that we are speaking here of jobs running for days or weeks. Hence a cronjob on the master node of the cluster could do all once per hour of every 10 minutes:
- read the job context and grep for the current set maximum duration
- generate emails when a certain limits is passed, and store the information that the email was send already in the job context too*
- a job that passed the limit will be killed
*) This additional context variable "WARNED_FOR=…" could simply get the same value as the just passed limit. As long as "ESTIMATED_RUNTIME" equals "WARNED_FOR" no additional email is generated. But if the user changes the "ESTIMATED_RUNTIME" we can detect this and an email can be send if the adjusted "ESTIMATED_RUNTIME" is about to be reached again. It might be easier, to have a wrapper around to convert hh:mm:ss to plain seconds or even advice the user to specify the limit in minutes or hours only as a general requirement. Hence no further conversion is necessary in the script.
I wonder how we can pull all information in one `qstat` call. The context variables of the running jobs you get with `qstat -s r -j "*"`, but the actual start of the job is output only in a plain `qstat -s r` or `qstat -s r -r`. To lower the impact on the qmaster we should not use a loop covering all currently running jobs one after the other only.
> What I'm asking for might not even be practical, but I thought it worth
> a try to ask.
> On 10/20/2018 05:02 AM, Reuti wrote:
>> Am 19.10.2018 um 22:44 schrieb Dj Merrill:
>>> Hi all,
>>> Assuming a soft run time limit for a queue, is there a way to send an
>>> email warning when the job is about to hit the limit?
>>> For example, for a job with "-soft -l s_rt=720:0:0" giving a 30 day run
>> You are aware, that this is a soft-soft limit. Means: I prefer a queue with a s_rt of 720:0:0, and if I get only 360:0:0 it's also fine.
>>> time, is there a way to send an email at the 25 day mark to let the
>>> person know the job will be forced to end in 5 days?
>> The s_rt will have already the purpose to send a signal (SIGUSR1) before h_rt is reached. Please have a look at "RESOURCE LIMITS" in `man queue_conf`. So I wonder, whether the combined usage of s_rt and h_rt (both with the default -hard option) could already provide what you want to implement.
>> Sure, the SIGUSR1 must be caught in the script and masked out in the called binary to avoid that it's killed by the SIGUSR1 default behavior. I use a subshell for it:
>> trap "echo Foo" SIGUSR1
>> (trap - SIGUSR1; my_binary)
>> as the SIGUSR1 is send to the complete process tree of the job. The "echo Foo" could be replaced by`mail -s Warning …`.
>>> I've thought about trying to draft a script to do this, but thought I'd
>>> ask first if anyone else has come up with something.
>> A completely different approach: use a checkpoint interface to send email a warning. The interval given to `qsub -c 600:0:0 -ckpt mailer_only …` represents the 25 days, and the checkpointing interface "mailer_only" does not do any real checkpointing, but has a script defined for "ckpt_command" which sends an email (i.e. "interface application-level" must be used).
>> There is an introduction to use the checkpoint interface here: https://arc.liv.ac.uk/SGE/howto/checkpointing.html
>> -- Reuti
> users mailing list
> users at gridengine.org
More information about the users