[gridengine users] Trying to get checkpointing to work

Wouter Verhelst Wouter.Verhelst at huawei.com
Tue Aug 5 11:09:06 UTC 2014

Hi folks,

I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of Gridengine 8.1.7. Since the machine that's running the N1 qmaster is due to be replaced in the coming months, I'm taking this opportunity to rethink the way our setup is done.

One thing I'm trying to do is to get checkpointing (with BLCR) to work so that jobs can be migrated to different machines if necessary. Our cluster does not consist of machines with the exact same specs; some machines have more memory than others. When we're running a lot of small jobs that don't require a lot of resources each, it doesn't matter which machine they run on and we want to spread the load over as many machines as possible so that the jobs are finished as quickly as possible; however, if such jobs are indeed running when a user wants to submit a job that does require a lot of resources, I want gridengine to checkpoint jobs on the high-memory machine to make room available for the new high-resources job, so that the high-resources job doesn't need to wait for large amounts of small-resources jobs to finish (which may take a long time).

I've done the following so far:

qconf -sq all.q|grep starter
starter_method     /usr/local/bin/blcr_start_job

this script checks if we have a $RESTARTED environment variable and if a checkpoint file exists. If so, it execs cr_restart; else, it execs cr_run.

qconf -sckpt BLCR
ckpt_name          BLCR
interface          APPLICATION-LEVEL
ckpt_command       /usr/local/bin/blcr_checkpoint
migr_command       /usr/local/bin/blcr_migrate
restart_command    NONE
clean_command      /usr/local/bin/blcr_clean
ckpt_dir           /opt/sge/default/common/ckpoint
signal             NONE
when               xsr

these scripts are based on the BLCR HOWTO by Peng and Ng from 2004, modified to account for the fact that BLCR does support checkpointing an entire process tree these days.

Finally, there's also this bit:

qconf -sq hiprio|grep -E 'subordinate|starter'
starter_method        /usr/local/bin/blcr_start_job
subordinate_list      slots=12(all.q:0:sr)

When I submit a job in the hiprio queue, it does suspend jobs in all.q, but I don't see it checkpointing the jobs.

Any hints as to what I'm missing?

More information about the users mailing list