[gridengine users] Message in stderr after exceeding resources
reuti at staff.uni-marburg.de
Wed Mar 2 18:59:10 UTC 2011
Am 02.03.2011 um 19:37 schrieb Chris Jewell:
> I was wondering if it was possible to get GE to output an error message to the stderr file in response to a job being killed due to it exceeding a resource request?
> Currently, we have an open doors policy on runtime (ie default h_rt=INFINITY) which is playing havoc with a) long jobs filling up the cluster and precluding short jobs from running (alleviated inefficiently with the introduction of a 'short' queue), and b) preventing efficient resource reservation for parallel SMP jobs. I'd therefore like to change the default time to 30mins, and have users explicitly request more time if they need it. However, I'm worried that the default position of killing jobs with a SIGKILL will confuse users. PBS Pro prints out a message to stderr to tell you why your job was killed (memory, time, io etc exceeded request): is there anything like this in GE I can use?
yep, it's sometimes not easy to investigate why a job was killed as you have to check the messages file of the appropriate nodes. As you have only SMP jobs in the parallel case there is only one machine to check, and it can be attached to the email which is send to the user. Please find attached a mail-wrapper which uses a local messages file, but it can be adjusted to reflect your path. In case you face race conditions that the email is send too early before there is an entry in the messages file, a `sleep 5` or alike should help.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1004 bytes
Desc: not available
More information about the users