[gridengine users] Trying to get checkpointing to work
Wouter.Verhelst at huawei.com
Tue Aug 5 11:09:06 UTC 2014
I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of Gridengine 8.1.7. Since the machine that's running the N1 qmaster is due to be replaced in the coming months, I'm taking this opportunity to rethink the way our setup is done.
One thing I'm trying to do is to get checkpointing (with BLCR) to work so that jobs can be migrated to different machines if necessary. Our cluster does not consist of machines with the exact same specs; some machines have more memory than others. When we're running a lot of small jobs that don't require a lot of resources each, it doesn't matter which machine they run on and we want to spread the load over as many machines as possible so that the jobs are finished as quickly as possible; however, if such jobs are indeed running when a user wants to submit a job that does require a lot of resources, I want gridengine to checkpoint jobs on the high-memory machine to make room available for the new high-resources job, so that the high-resources job doesn't need to wait for large amounts of small-resources jobs to finish (which may take a long time).
I've done the following so far:
qconf -sq all.q|grep starter
this script checks if we have a $RESTARTED environment variable and if a checkpoint file exists. If so, it execs cr_restart; else, it execs cr_run.
qconf -sckpt BLCR
these scripts are based on the BLCR HOWTO by Peng and Ng from 2004, modified to account for the fact that BLCR does support checkpointing an entire process tree these days.
Finally, there's also this bit:
qconf -sq hiprio|grep -E 'subordinate|starter'
When I submit a job in the hiprio queue, it does suspend jobs in all.q, but I don't see it checkpointing the jobs.
Any hints as to what I'm missing?
More information about the users