[gridengine users] Some simple SGE questions concerning checkpointing and "sleeping" of jobs for queue equity.
jake.carroll at uq.edu.au
Tue May 15 20:59:19 UTC 2012
A couple of quick questions this morning with some ROCKS/SGE scheduler semantics.
1. I've got some new users who want to drive the cluster we have set up with the very maximum efficiency possible. I.e – a user can use as much of the cluster that is possible when they submit a job. With over 1000 cores but many users, one of the things we did do was limit a users ability to take up more than about 300 or 400 slots, such that they could only ever utilise maybe 20 to 30% of the cluster at any given time. My new users don't like this –and they want to be able to use 100% of the system, if it's free and no other jobs are running. Now, my understanding is that we could definitely remove that limit of 300 or 400 slots/jobs, but it'll have a couple of detrimental impacts:
Primarily –it'll preclude any other user from starting jobs at any given time if their jobs are running, as there are no free slots.
2. My users told me "no, no – you can simply put our jobs "to sleep" when others in the queue log in to run their jobs.
Now, my understanding of that is, yes, that is possible (though, I don't know how it's implemented – fairshare policy queue / weight perhaps?) BUT it has the big drawback that when a users job is "asleep", it will actually still keep ahold of the memory allocation on the node, thus, if another big mem job comes along and the node is memory over-subscribed, crashing scenarios will ensue! Can somebody confirm that kind of functionality/concern for me?
3. My users want jobs to "persist" over the course of a cluster head node crash. Would I be right in saying that it's only possible to persist across crashes if the users are using CHECKPOINTING in their jobs? I've heard of it before –just never implemented it and don't know where to start.
Thank you for your time, all.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users