[gridengine users] integrate BLCR and SGE

Reuti reuti at staff.uni-marburg.de
Mon Jul 9 12:29:14 UTC 2012


Am 09.07.2012 um 11:53 schrieb mahbube rustaee:

> yes, I mean that state is "Rr" for ever. job not completed at sge view!

Aha, I got it in the wrong way, sorry.

How does the job complete - a normal exit from the script. What's its state in `qacct`?

-- Reuti


> On Mon, Jul 9, 2012 at 1:48 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.07.2012 um 06:28 schrieb mahbube rustaee:
> 
> > when job restarts and reschedule , job remains on "Rr" state after job be completed!
> > top command shows some processes are running too.
> > I checked ckpt.log for clean_method script and it shows clean_method doesn't run.
> >  with qdel clean_method runs.
> 
> Correct. This is the intended behavior. The clean_method has the purpose to remove any checkpointing files after the job. This you won't do when you just checkpoint a file.
> 
> There are nice state diagrams in:
> 
> http://arc.liv.ac.uk/SGE/howto/APSTC-TB-2004-005.bookpdf
> 
> -- Reuti
> 
> 
> >
> > On Mon, Jul 9, 2012 at 3:58 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 08.07.2012 um 06:01 schrieb mahbube rustaee:
> >
> > >
> > >
> > > On Sat, Jul 7, 2012 at 6:02 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > Am 07.07.2012 um 15:18 schrieb mahbube rustaee:
> > >
> > > > On Wed, Jul 4, 2012 at 2:23 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > > Am 04.07.2012 um 05:59 schrieb mahbube rustaee:
> > > >
> > > > > parallel application . I test it for multiprocess , multithreaded now.
> > > > > (Open MPI) 1.4.3
> > > > > lines  that makes checkpoint at  blcr_chechpoint .sh (checkpoint script defined for BLCR checkpoint)
> > > >
> > > > Is it working fine outside of SGE?
> > > > No, but I wondered about scripts  and thread on it. some changes should be done.
> > >
> > > First it must work outside of SGE. SGE doesn't provide any builtin checkpointing at all, but can trigger only an available checkpoiniting which is working outside of SGE already.
> > >
> > >
> > > > I want to suspending  job  (qmod -sj) make checkpoint, free all resources and resume job (qmod -usj) continue on last checkpoint.
> > >
> > > This is not done by unsuspending the job. The checkpointing environment needs to be set up to checkpoint on suspend. Then the job is rescheduled and waiting again for execution.
> > >
> > >
> > > > I used suspend_method and resume_method queue parameter.
> > > > for that  --stop option on cr_checkpoint just stop execution but not free resources.
> > > > --kill/--term option of cr_checkpoint makes checkpoint and kill processes ,job state be "s" for a while but
> > >
> > > Yep.
> > >
> > > The job should be in the list of waiting jobs again (state Rq).
> > >
> > > Yes, I configured  BLCR checkpoint . when a job suspend (qmod -sj) state be "s" and will be queue (Rq ) automatically.
> >
> > This is the normal behavior.
> >
> > > How can do that manually? I mean job be in "s" state until unsuspend it manually and  job be in queue again.
> >
> > In the migration script you could put a hold on the job which you have to release later on.
> >
> >
> >
> > -- Reuti
> >
> >
> > > -- Reuti
> > >
> > > > job will be removed (qstat makes no result). In this case qmod -usj <..> cause " invalid queue or job <..>" .
> > > >
> > > > What hints was missed?
> > > >
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > cpid=`pstree -p $pid | head -1 | perl -pe '$p="g\?time"; $p=cr_restart  if(/cr_restart\(\d+\)/);s/.*-$p\(\d+\)[-\+]+[^(]+\((\d+)\)/$1/g;'`
> > > > > cr_checkpoint -f $ckptfile --run $cpid
> > > > >
> > > > > for parallel application $cpid shows multiple pids and e.g commands :
> > > > > "cr_checkpoint -f context-ckpt  --run  17147 17148" will be issued .
> > > > > it fails for multiple pids!
> > > > >
> > > > > all scripts used for BLCR checkpoint attached.
> > > > >
> > > > >
> > > > > On Tue, Jul 3, 2012 at 2:22 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > > > > Am 03.07.2012 um 11:29 schrieb mahbube rustaee:
> > > > >
> > > > > > I use sge6.2u5 and blcr 0.8.4.
> > > > > > used guide for integration url:
> > > > > >
> > > > > > https://hpcrdm.lbl.gov/pipermail/checkpoint/2010-November/000122.html
> > > > > > but this scripts just works for serial jobs (one pid).
> > > > > > cr_checkpoint  command does n't act on multiple pids simultaneous while this script use:
> > > > > > cr_checkpoint -f $ckptfile --run $cpid
> > > > > > parallel program consists of multiple pids and checkpoint failed.
> > > > >
> > > > > For which type of application? Open MPI supports it directly.
> > > > >
> > > > > -- Reuti
> > > > >
> > > > > <blcr_sge_scripts.tgz>
> > > >
> > > >
> > >
> > >
> >
> >
> 
> 





More information about the users mailing list