[gridengine users] Trying to get checkpointing to work
Wouter.Verhelst at huawei.com
Tue Aug 12 09:51:21 UTC 2014
> However, I've since noticed that there was actually an error in my
> script, which caused it to fail. This would probably explain why it
> wasn't doing any checkpoints...
I've since debugged my script, and my initial tests show that checkpointing a job with BLCR (due to subordinate queueing) and restarting it on the same host later works perfectly. So far so good.
I'm now also trying to migrate a job from one host to another, but I'm bumping into an issue that I don't immediately see a solution for:
When restarting a job, cr_restart will try to restart the job in exactly the same context as it was before. This includes files we're writing to, reading from, etc. Unfortunately, that also includes the job script which we're actually running, which is <execd_spooldir>/<hostname>/<jobid> or some such. When a job is migrated to another host, the result is then that cr_restart tries to open a file in the old host's spooldir, which no longer exists (the files have been moved to the new host's spooldir).
cr_restart has an option '--relocate' which would allow me to fix this issue, but then I would need to know the hostname of the host where the checkpoint was created. As far as I can see, that isn't any information that SGE stores, but I might be missing something...?
More information about the users