[gridengine users] h_rt and job exit code

Reuti reuti at staff.uni-marburg.de
Mon Oct 29 17:06:48 UTC 2012


Am 29.10.2012 um 17:12 schrieb Julien Nicoulaud:

> For those interested, I worked around the issue by switching from qsub to qrsh, everything seems to work fine so far.

Thx for sharing this info.

-- Reuti


> 2012/10/18 Reuti <reuti at staff.uni-marburg.de>
> Am 21.09.2012 um 18:06 schrieb Julien Nicoulaud:
> 
> > Yes, still the same question, I'm trying to get a proper exit code for "qsub -sync y" :)
> > When I talk about graceful shutdown, I only talk about the slaves. It really seems to me that whatever happens, if the slave tasks are not cleanly shut down, qsub will always show this "Unable to run job" message and return 0.
> 
> I could only think of a wrapper, which will scan for a file which is only written at the regular end of the job and set its exit value accordingly.
> 
> -- Reuti
> 
> 
> > 2012/9/21 Reuti <reuti at staff.uni-marburg.de>
> > Am 21.09.2012 um 16:13 schrieb Julien Nicoulaud:
> >
> > > I tried to implement the -notify + trap USR2 solution, but could not get it to work. I can trap the USR2 signal in the qmaster PE script, but as soon as it is sent, the slave tasks get killed, leaving my application no time to cleanly shut them down. The qmaster log displays:
> >
> > Is this a new question? Originally you wanted to get a proper exit code for -sync y, now to gracefully shut down.
> >
> > -- Reuti
> >
> >
> > > tightly integrated parallel task 61969.1 task 1.computeXX failed - killing job
> > >
> > > The queue is configured with "notify 00:00:60", so that should leave at least one minute. I also tried to trap USR2 in the PE script and not forward it all to child processes, but slave tasks still get killed. Is there something else specific to do to avoid this?
> > >
> > > 2012/9/19 Julien Nicoulaud <julien.nicoulaud at gmail.com>
> > > Yes, that's what I meant. For me, if control_slaves is FALSE, qsub returns with a non-zero exit code after h_rt is elapsed.
> > >
> > >
> > > 2012/9/19 Reuti <reuti at staff.uni-marburg.de>
> > > Hi,
> > >
> > > Am 19.09.2012 um 14:36 schrieb Julien Nicoulaud:
> > >
> > > > On SGE 6.2u5, I submit jobs with -sync y and h_rt. When the jobs gets killed after the time is elapsed, qsub prints a "Unable to run job" message but exists with code 0.  I tried to trap KILL signal
> > > > inside the job script, but it does not seem to affect qsub return code. Is it possible to make it return 1 ?
> > > >
> > > > Note: it only behaves this way for jobs running in a tightly integrated parallel environment. In a loosely integrated PE, qsub returns 1 in this case...
> > >
> > > You mean the setting of "control_slaves"? For me it's always 0 if I request a PE.
> > >
> > > -- Reuti
> > >
> > >
> >
> >
> 
> 





More information about the users mailing list