[gridengine users] Caught SIGTERM on termination, but handling fails after some time

Reuti reuti at staff.uni-marburg.de
Mon Dec 2 15:18:58 UTC 2013


Am 02.12.2013 um 01:45 schrieb David Dotson:

> 
> On 12/01/2013 04:03 PM, David Dotson wrote:
>> 
>> On 12/01/2013 03:13 PM, Reuti wrote:
>>> Hi,
>>> 
>>> Am 01.12.2013 um 22:20 schrieb David Dotson:
>>> 
>>>> Greetings,
>>>> 
>>>> We have the terminate_method for our queue set to SIGTERM, so that when the following submission script runs, it should copy back all the files generated to the original directory. The signal is indeed caught, and the copy-back takes place, but it often dies without completing after a short amount of time.
>>>> 
>>>> # BEGIN SCRIPT
>>>> #============
>>>> 
>>>>  # standard gridengine script with automatic copying back of data
>>>> #$ -S /bin/bash
>>>> #$ -N grid_job
>>>> #$ -pe singlenode 16
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -R y -r n
>>>> 
>>>> 
>>>> # set up scratch directory
>>>> WORK=/scratch/${USER}/WORK/${JOB_ID}
>>>> ORIG=$PWD
>>> The scratch directory supplied by SGE, i.e. $TMPDIR is not sufficient? It's the one set up in the queue definition "tmpdir".
>> I was not aware of this option. It shouldn't make a difference, but could it? We may very well change our standard submission scripts to reference $TMPDIR instead if that's the case.
> I realized why we chose not to use $TMPDIR: this directory is automatically deleted on job exit. We prefer being able to salvage data in the case of a copy failure, power loss, etc.

Yep. This is correct. BTW: is all data therein valuable, or does it include unnecessary scratch files too? These could then be separated into the $TMPDIR and your persistent one for the important ones.

-- Reuti


>>> 
>>> 
>>>> function setup_workdir () {
>>>> 
>>>>     echo "-- [$(date)] setting up $WORK"
>>>> 
>>>>     mkdir -p $WORK
>>>>       test -d $WORK || { echo "EE ERROR: Failed to make tmpdir"; exit 1; }
>>>> 
>>>>     cp $TPR $DEFFNM.cpt $DEFFNM.xtc $DEFFNM.trr $DEFFNM.edr $DEFFNM.log $WORK
>>>> copy_success="True"
>>>> }
>>>> 
>>>> 
>>>> function cleanup_exit () {
>>>> 
>>>>     # ensure that we don't overwrite complete files with partial ones if job killed mid-copy
>>>> 
>>>>     echo "-- [$(date)] cleaning up: $WORK --> $ORIG"
>>>> 
>>>>     cp $WORK/* $ORIG || { echo "EE ERROR: Did not copy $WORK --- check manually!"; exit 1; }
>>>> 
>>>>           cd $ORIG
>>>> 
>>>>     rm -r $WORK
>>>> 
>>>>     exit 0
>>>> }
>>>> 
>>>> 
>>>> # make sure that killing the job copies back everything; won't copy back if job
>>>> # killed while copying to workstation (a good thing!)
>>>> # (GE must be configured to use SIGTERM for killing jobs!)
>>>> trap
>>>>  cleanup_exit TERM
>>>>  setup_workdir
>>>> 
>>>> cd $WORK || { echo "EE ERROR: failed to cd $WORK"; exit 2; }
>>>>    # MAIN COMPUTATION RUNS HERE
>>>>  cleanup_exit
>>>> 
>>>> 
>>>> #============
>>>> # END SCRIPT
>>>> 
>>>> What is happening here? Is a second SIGTERM sent by gridengine after some time? If so, what is the best way to ensure this copy-back completes on qdel?
>>> Yes, this might happen - how long does the copy process take to complete. It should be recorded in the message for the node though (do you see a 90 sec interval?).
>> The time it takes for the job to die during cleanup appears to vary. I have had instances where it takes minutes, and others in which it takes seconds. This has made it very frustrating to figure out, and it's why I'm reaching out for some help.
>>> 
>>> 
>>>> As a note, I have tried sending SIGTERM as a notification instead, and setting the `notify` queue configuration key to 24:00:00
>>> And changed the signal in SGE's configuration ("NOTIFY_KILL=sigterm") and submitted with "-notify"? This would be better than changing the "terminate_method" to something which must be handled inside the script to kill itself.
>> Correct. I added this key to "execd_params" in the SGE configuration, and submitted the job with the "-notify" flag. As I said, it seems to be working in some quick tests I did today, but I do recall this failing in the past when actually copying back large files (meaning the copy was killed mid-copy).
>>> 
>>> 
>>>> (basically, REALLY LONG). This seems to work in some of my tests, but it has failed in actual use when copying back large data files.
>>> What do you mean by failed - is was killed anyway?
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> David
>>>> -- 
>>>> David L. Dotson
>>>> Center for Biological Physics
>>>> Arizona State University
>>>> 
>>>> Email:
>>>> dldotson at asu.edu
>>>> _______________________________________________
>>>> users mailing list
>>>> users at gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> -- 
> David L. Dotson
> Center for Biological Physics
> Arizona State University
> 
> Email: dldotson at asu.edu
> 





More information about the users mailing list