[gridengine users] pausing or holding a job and computing the next job
bharanitn at yahoo.com
Tue Mar 8 05:09:39 UTC 2011
Thank you so much for that elaborate explanation as always :)
This method works brilliant!!!!
Thanks a ton,
--- On Mon, 7/3/11, Reuti <reuti at staff.uni-marburg.de> wrote:
From: Reuti <reuti at staff.uni-marburg.de>
Subject: Re: [gridengine users] pausing or holding a job and computing the next job
To: "Bharanidharan Narayanaswamy" <bharanitn at yahoo.com>
Cc: "users at gridengine.org Users" <users at gridengine.org>
Date: Monday, 7 March, 2011, 9:01 PM
Am 07.03.2011 um 15:15 schrieb Bharanidharan Narayanaswamy:
> There is a single queue available to the users. Now a user has submitted a job which is going to take a long time to compute. Another users who has a job in queue is much simpler and will complete in few minutes.
> what would be the best / effective method to send the second job in place of the first job.
> The trouble here is that there is no application level checkpointing.
> I'm using drmaa to submit batch jobs.
there are different approaches possible. All have in common, that for SGE a started job will use the requested resources up to its end - it won't release them in any case unless it gets rescheduled or deleted.
- The long job could be started in a queue with a nice value of 19 (setting "priority" in the queue definition). The shorter job will then get for a short time more CPU resources in a different queue with nice 0. As nice values are relative, multiple jobs with nice 19 in the long queue behave the same way as multiple jobs with nice 0.
- The long running jobs could be suspended by setting "subordinate_list" in the short queues definition. This way the long running job will be stopped during the execution of the short job and continue afterwards. This can be extended to have a slotwise subordination to stop only one of the long running jobs on a node and not all in that queue, but it won't restart the suspended jobs under certain conditions in 6.2u5 though in this case.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users