[gridengine users] reschedule_unknown and state "t"

Reuti reuti at staff.uni-marburg.de
Thu Nov 8 13:50:54 UTC 2012


Am 08.11.2012 um 14:41 schrieb William Hay:

> Checkpoint every 6 minutes (Found this while testing the checkpoint environment we'll probably increase the minimum for production).

This is the setting "min_cpu_interval" in the queue defintion - does "when" in the checkpointing environment include "r"?

-- Reuti


> 
> On 8 November 2012 13:18, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 08.11.2012 um 13:59 schrieb William Hay:
> 
> > There was a checkpointing environment.  Also the same thing seems to happen in the first few minutes the job is running but not afterwards.
> 
> Aha, this is different according to the documentation. What was defined in the checkpointing environment for the "when" condition?
> 
> -- Reuti
> 
> 
> >
> > On 8 November 2012 12:29, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 02.11.2012 um 15:56 schrieb William Hay:
> >
> > > I submitted an array job with -r y.  One of the tasks was transferring to a node (state t) when that node went down but despite max_unheard+reschedule_unknown being exceeded neither that task nor another task on the same node was rescheduled.  A manual qmod -rq seems to work but just working would be better.
> >
> > But if the node crashes while all jobs are state "r" it working for you - there was no checkpointing environment in the way?
> >
> > The array task was still shown in state "t" all the time?
> >
> >
> > > Is this a known problem?
> >
> > It's hard to provoke.
> >
> > - Reuti
> >
> >
> 
> 
> 
> 





More information about the users mailing list