[gridengine users] cgroups Integration in OGS/GE 2011.11 update 1
rayrayson at gmail.com
Fri Jun 8 14:25:27 UTC 2012
On Fri, Jun 1, 2012 at 6:19 AM, Mark Dixon <m.c.dixon at leeds.ac.uk> wrote:
> My underlying concern is that sometimes it is appropriate to set an address
> space limit and sometimes it isn't, for the reasons we both put forward
> previously in this thread. Users should therefore have some control over it.
> I hope we agree on this much?
Yes. It's good that we have your feedback (and William Hay's) *before*
we release this new feature!
Also, I brought it up a while ago that Grid Engine itself also kills
the job if the memory usage exceeds the limit. So the end result is
likely very similar, ie. the user finds that the job gets killed.
> Ah-ha, that's the missing piece that's been confusing me!
> The options here seem to be:
> 1) Ask the kernel people nicely to give us a per-cgroup address space limit.
> I don't think they will see much point of this.
I am working with other software that uses cgroups, and someone
pointed me to email yesterday:
I don't think the per-cgroup AS limit is that hard - in the end the
"memory.memsw.usage_in_bytes" file already shows the recorded max
memory+Swap usage. I believe the kernel does its own internally
accounting when memory is allocated, so it may be a matter of
enforcing the limit in a different way.
Of course, the difficult part is that the kernel overcommits memory,
so it's likely that actual memory is not allocated & accounted by
"memory.memsw.usage_in_bytes" until page faults occur - which was what
we've discussed previously...
I was thinking about this issue again last night:
First of all, in the end a job's "h_vmem" (reported by Grid Engine's
PDC) should always be greater than or equal to
"memory.memsw.usage_in_bytes" reported by the kernel.
May be I should also add a test case to test for it. But can you think
of a case where:
( h_vmem >= memory.memsw.usage_in_bytes )
It should work something like this:
- initially, when a job allocates memory, the h_vmem should go up
immediately but not the accounted virtual size / address space usage
in "memory.memsw.usage_in_bytes" - so in theory if the pages are not
faulted at that time, "memory.memsw.usage_in_bytes" should be very
close to zero.
- as more and more pages are used (and thus causing page faults), the
memory usage reported by "memory.memsw.usage_in_bytes" should get
closer & closer to h_vmem.
- then in the extreme case, when every page single page is used, then
h_vmem == memory.memsw.usage_in_bytes .
If my logic is correct, then it shouldn't be any issue related to jobs
getting killed due to this change (which is more important than
anything - killing innocent jobs is like killing innocent people.
While Jack Bauer kills a few innocent bad guys in every season of 24,
his first priority is always about saving innocent people... and we
should do the same as well!).
And the 2nd part is related to accounting - ie. when the job's "real"
h_vmem is greater than the reported usage in the
"memory.memsw.usage_in_bytes" file. Would we get different behavior
than the procfs based PDC??
IMO, if we still poll the /proc filesystem for the h_vmem (ie. sum of
h_vmem of all processes in a job) periodically but less as frequent,
then it should not be a real issue. If a process exceeds the h_vmem
limit, then it also means that it exceeds the limit imposed by
setrlimit(2), which is also set even when OGS/GE is using cgroups. So
with /procfs PDC or cgroups PDC, the process would get the same
treatment by the kernel... But if the sum of h_vmem of all processes
of a job exceeds the h_vmem, then the periodic procfs poll would still
catch this case, and the action taken would be the same for both
I am less concerned about the 2nd part... and we should be more
lenient. In the end, it does not hurt the system if it is just the
virtual size usage exceeding the h_vmem temporary. In the end, h_vmem
is nothing but the max. bound of valid address space of the job, NOT
physical memory pages and NOT even any space in the swap (Linux VM by
default overcommits memory - and in the non-overcommit case, then the
logic in the first part should handle it nicely).
As long as innocent jobs don't get killed, and system performance is
not hurting due to cgroups integration, then everyone is happy...
Not sure if I have already covered all cases... or am I still missing
> 2) As well as using setrlimit, enforce a per-cgroup address space limit by
> the PDC periodically polling just the processes in that cgroup. Does s_rss,
> s_stack, etc. do anything in gridengine these days - do you already have a
> such a poll loop to deliver that functionality?
> 3) Bring the definitions of h_vmem / s_vmem into line with the likes of
> h_stack, h_rss, etc. - interpret them in terms of setrlimit only and make no
> attempt to enforce per-job limits.
> Even if successful, I agree that (1) sounds like a major headache. (2) gives
> the greatest backwards compatibility. If you don't already have a poll loop
> and want to avoid putting one in, (3) should be sufficient to avoid loss of
>> (May be I should have clarified the above point in my previous email -
>> but I was really busy these days, working on the GE2011.11u1 release,
>> handling outside of the mailing list user support, and talking to
>> hardware vendors, etc...)
> Thanks for continuing this conversation, I appreciate (and apologise for)
> the time you're putting into it. I've obviously not done a very good job at
> being clear and concise.
> All the best,
> Mark Dixon Email : m.c.dixon at leeds.ac.uk
> HPC/Grid Systems Support Tel (int): 35429
> Information Systems Services Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> users mailing list
> users at gridengine.org
More information about the users