[gridengine users] Different Error Codes for Job Failure

Reuti reuti at staff.uni-marburg.de
Fri Mar 1 11:15:13 UTC 2013


Am 01.03.2013 um 08:16 schrieb S Barve:

> We are facing the same issue. Apparently, the signal sent by 'qdel' is SIGKILL regardless of whether jobs are terminated by users or by administrators. 
> 
> We tried a couple of things to distinguish between a job killed by a user using 'qdel' and a job killed by the administrator using 'qdel' : 
> 
> 1) Change 'terminate_method' for the user's queue to "SIGTERM" and have the user submit a job with the '-notify' flag. However, the same signal is recorded in the job output file. 
> 
> 2) Change 'terminate_method' for the user's queue to point to a custom script for killing jobs. We try to catch the user id of the calling process (qdel) in the script. However, that user ID is reported as '0' whether the qdel command is invoked by the user or by the administrator. 
> 
> Is there a way to know which user has invoked the qdel command? That might help us figure out who killed the job. 

The user who initiated the `qdel` is recorded in the messages file of the qmaster as info (adjust SGE's configuration to have "log_level log_info" set):

03/01/2013 11:38:53|worker|pc15370|I|reuti has registered the job 5658 for deletion

-- Reuti


> Thanks and regards,
> Saurabh Barve
> Pune,Maharashtra
> India
> Mailto: s.barve at tcs.com
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.        IT Services
>                        Business Solutions
>                        Outsourcing
> ____________________________________________ 
> 
> 
> From:	Kshitiz B <kshitiz.b at tcs.com>
> To:	users at gridengine.org
> Date:	03/01/2013 11:22 AM
> Subject:	[gridengine users] Different Error Codes for Job Failure
> Sent by:	users-bounces at gridengine.org
> 
> 
> 
> 
> How to distinguish between the following scenarios which leads to job deletion : 
> 
> 1. Slave Node Failure
> 2. Master/Shepherd Node Failure
> 3. Job deleted by User
> 4. Job deleted by Admin
> 
> Only after figuring out which of the above scenario lead to the job deletion , we will be able to do the correct billing of the customer .
> 
> qacct -j <jobid> gives :
> 1. failed : but it does not cover error codes for above scenarios
> 2. exit_status : how to use it in this relevance
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you_______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 





More information about the users mailing list