[gridengine users] integrate BLCR and SGE
mahbube rustaee
rustaee at gmail.com
Sun Jul 8 04:01:22 UTC 2012
On Sat, Jul 7, 2012 at 6:02 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 07.07.2012 um 15:18 schrieb mahbube rustaee:
>
> > On Wed, Jul 4, 2012 at 2:23 PM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > Am 04.07.2012 um 05:59 schrieb mahbube rustaee:
> >
> > > parallel application . I test it for multiprocess , multithreaded now.
> > > (Open MPI) 1.4.3
> > > lines that makes checkpoint at blcr_chechpoint .sh (checkpoint
> script defined for BLCR checkpoint)
> >
> > Is it working fine outside of SGE?
> > No, but I wondered about scripts and thread on it. some changes should
> be done.
>
> First it must work outside of SGE. SGE doesn't provide any builtin
> checkpointing at all, but can trigger only an available checkpoiniting
> which is working outside of SGE already.
>
>
> > I want to suspending job (qmod -sj) make checkpoint, free all
> resources and resume job (qmod -usj) continue on last checkpoint.
>
> This is not done by unsuspending the job. The checkpointing environment
> needs to be set up to checkpoint on suspend. Then the job is rescheduled
> and waiting again for execution.
>
>
> > I used suspend_method and resume_method queue parameter.
> > for that --stop option on cr_checkpoint just stop execution but not
> free resources.
> > --kill/--term option of cr_checkpoint makes checkpoint and kill
> processes ,job state be "s" for a while but
>
> Yep.
>
> The job should be in the list of waiting jobs again (state Rq).
>
> Yes, I configured BLCR checkpoint . when a job suspend (qmod -sj) state
be "s" and will be queue (Rq ) automatically.
How can do that manually? I mean job be in "s" state until unsuspend it
manually and job be in queue again.
-- Reuti
>
> > job will be removed (qstat makes no result). In this case qmod -usj <..>
> cause " invalid queue or job <..>" .
> >
> > What hints was missed?
> >
> >
> > -- Reuti
> >
> >
> > > cpid=`pstree -p $pid | head -1 | perl -pe '$p="g\?time"; $p=cr_restart
> if(/cr_restart\(\d+\)/);s/.*-$p\(\d+\)[-\+]+[^(]+\((\d+)\)/$1/g;'`
> > > cr_checkpoint -f $ckptfile --run $cpid
> > >
> > > for parallel application $cpid shows multiple pids and e.g commands :
> > > "cr_checkpoint -f context-ckpt --run 17147 17148" will be issued .
> > > it fails for multiple pids!
> > >
> > > all scripts used for BLCR checkpoint attached.
> > >
> > >
> > > On Tue, Jul 3, 2012 at 2:22 PM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > > Am 03.07.2012 um 11:29 schrieb mahbube rustaee:
> > >
> > > > I use sge6.2u5 and blcr 0.8.4.
> > > > used guide for integration url:
> > > >
> > > >
> https://hpcrdm.lbl.gov/pipermail/checkpoint/2010-November/000122.html
> > > > but this scripts just works for serial jobs (one pid).
> > > > cr_checkpoint command does n't act on multiple pids simultaneous
> while this script use:
> > > > cr_checkpoint -f $ckptfile --run $cpid
> > > > parallel program consists of multiple pids and checkpoint failed.
> > >
> > > For which type of application? Open MPI supports it directly.
> > >
> > > -- Reuti
> > >
> > > <blcr_sge_scripts.tgz>
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20120708/3e80a4ef/attachment.html>
More information about the users
mailing list