[gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
wlee at hess.com
Wed Sep 14 20:52:12 UTC 2016
Thanks for the prompt reply. Apologies for not including more detail with regards to my query concerning getting Grid Engine to force all jobs with an exit status other than 0, 99 or 100 to error state (i.e. exit code of 100).
As I stated in my earlier post our jobs execute an epilog script which is named "gp_epilog" at the conclusion of the job running on a given execution host. The "gp_epilog" essentially does the following:
1. Obtains the "exit_status" value from the execution host's job spool directory from a file named "usage". As an example, take a look at the directory listing below from a test job on an execution host with name "g00801" where the execution host's spool directory is /tmp/ge/. You then will see the "usage" file. The contents of the "usage" file is shown below the directory contents. The "exit_status" in the example below is 137.
Directory listing of /tmp/ge/g00801/active_jobs/1012.1
drwxr-xr-x 2 sgeadmin adm 4096 Sep 13 13:12 .
drwxr-xr-x 3 sgeadmin adm 4096 Sep 13 13:12 ..
-rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 addgrpid
-rw-r--r-- 1 sgeadmin adm 2236 Sep 13 13:12 config
-rw-r--r-- 1 sgeadmin adm 1546 Sep 13 13:12 environment
-rw-r--r-- 1 tdhf781 hougeo 0 Sep 13 13:12 error
-rw-r--r-- 1 tdhf781 hougeo 0 Sep 13 13:12 exit_status
prw-r--r-- 1 sgeadmin adm 0 Sep 13 13:12 fifo_execd_to_shepherd
-rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 job_pid
-rw-r--r-- 1 sgeadmin adm 54 Sep 13 13:12 pe_hostfile
-rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 pid
-rw-r--r-- 1 tdhf781 hougeo 9095 Sep 13 13:12 trace
-rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage
Contents of Usage file output !!!
2. Once the value of the "exit_status" is parsed from the "usage" file, the "gp_epilog" script just does a check to see if the value of "exit_status" doesn't equal 0, 99 or 100. If it doesn't equal 0, 99 or 100, then the "gp_epilog" script executes an "exit 100". I'm assuming the "exit_status" value from the "usage" file is from the application that is from the job/job tasks that executed on the execution host g00801 from the example I've listed above. I was thinking that if I issue an "exit 100" from within the "gp_epilog" script I've got, the job/job task would show up in "error state". I would see this show up in a "qstat" output with the job/job task showing a state of "Eqw" or something similar.
I've performed some tests by submitting a basic shell script which dumps the environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 100", "exit 137" other exit status codes. If I set my script to "exit 0", the job exits normally. If I set my script to "exit 99", then the job gets requeued for execution and if I set my script to "exit 100", the job goes into error state. All of these scenarios are what I expect based on the man pages for "queue_conf". However, I am unable to use any other "exit ##", trap it and force the job to error state by the method I describe.
I'm not sure if what I'm trying to do makes sense or should I consider a different way to do what I am attempting. I can look at the "starter_method" to see if this is a viable way.
Thanks in advance.
From: William Hay [mailto:w.hay at ucl.ac.uk]
Sent: Wednesday, September 14, 2016 2:38 AM
To: Lee, Wayne <wlee at hess.com>
Cc: users at gridengine.org Group <users at gridengine.org>
Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.
On Tue, Sep 13, 2016 at 06:52:53PM +0000, Lee, Wayne wrote:
> In the epilog script that I've setup for our jobs, I've attempted to
> capture the value of the "exit_status" of a job or job task and if it
> isn't 0, 99 or 100, exit the epilog script with an "exit 100". However
> this doesn't appear to work.
In general when describing an issue or problem it is more helpful to describe what does happen than what doesn't. The number of things that didn't happen when you made the epilog script exit 100 is almost infinite.
> Anyway way of stating what I'm trying to convey is if the exit status a
> job or job task is anything other than 0, 99 or 100 put the job in error
> state. If this can be done, then we would know that a job didn't
> complete correctly and if it is in Eqw state we have the option of
> clearing error state (i.e. qmod -cj) and re-executing the job again.
One possibility would be to write a starter_method that wraps the real job and does an exit 100 when the job terminates with an exit status other than 0 or 99.
More information about the users