[gridengine users] How to use condor checkpointing with SGE
dowobeha at gmail.com
Wed Mar 16 20:34:16 UTC 2011
On Wed, Mar 16, 2011 at 4:24 PM, Lane Schwartz <dowobeha at gmail.com> wrote:
> On Wed, Mar 16, 2011 at 2:47 PM, Reuti <reuti at staff.uni-marburg.de>wrote:
>> Am 16.03.2011 um 19:35 schrieb Lane Schwartz:
>> > <snip>
>> > The job gets queued up and assigned to run, and the stderr and stdout
>> files are created. When a checkpointable job starts, condor and DMTCP each
>> print a small log message. That log message shows up in the logs. But no
>> output from my program appears. SGE lists my job's status as "r" but when I
>> ssh in to the machine where the job is running and run ps aux, ps lists my
>> job's status as suspended.
>> > When I launch my checkpointable jobs locally (not using qsub) they run
>> and produce immediate output. When I run those same jobs using qsub, they go
>> into "r" status, but never produce output and appear to not be actually
>> > On a related topic, using 6.2u5p1 I've had mixed results following the
>> checkpointing interface tutorial at
>> http://gridscheduler.sourceforge.net/howto/checkpointing.html. The
>> initial examples describe setting up a transparent interface and running it
>> with some simply shell scripts; I've been able to get these to work as
>> described. I've also followed the examples for setting up application-level
>> interface with shell scripts; that works, but only the migr_command and
>> clean_command appear to run. When I run example 6, which uses condor in
>> conjunction with transparent checkpointing, no condor checkpoint files are
>> You set usr2 as the to be used signal and waited at least
>> min_cpu_interval? Still no checkpoint file is created in /home/checkpoint or
>> alike? Can you try sending usr by hand to the complete process group on the
> I can confirm. I ran the following, and no checkpoint file was created:
> $ qsub -ckpt transparent -b y -cwd -V ./condor_transparent6.sh
> $ qstat
> ... lists the above job in state "r", with job-ID 114 ...
> $ ps aux | grep lane; ps -eo "%U %p %r %a %c" | grep lane
> ... lists the processes associated with the job.
> ... The parent process has PID 10240, and is "-csh c
> ./condor_transparent6.sh" with ps state "Ss"
> ... The second process has PID 10322, and is "/bin/bash
> ./condor_transparent6.sh" with ps state "S"
> ... The third process has PID, and is running the actual condor-linked
> binary with ps state "S"
> ... These three jobs have group PGID 10240.
> $ kill -s USR2 -- -10240
> $ qstat
> ... My job is no longer listed ...
> $ ls /tmp/114
> ... No files are listed. The directory exists, though, which makes sense
> since "Checkpoint Directory" is set to /tmp in the checkpointing
> My checkpoint interface definition is below:
> Name: transparent
> Interface: TRANSPARENT
> Checkpoint command: NONE
> Migrate command: NONE
> Clean command: NONE
> Checkpoint directory: /tmp
> Checkpoint When: xsr
> Checkpoint Signal: NONE
> This is all on a sandbox grid setup using version 6.2u5p1. The script a
> slightly modified version of the condor_transparent6.sh script in the howto
> (I added some echo statement to print variable values). The binary is a toy
> C++ program that increments integer values then prints them out in a big
Just to make sure it wasn't my toy binary, I just re-ran with your ever.c
program. Using that, a checkpoint file was created. My toy binary used the
sleep command. OK, this is good. :)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users