[gridengine users] PE Job Suspend / Resume
rayrayson at gmail.com
Wed Jun 13 05:20:20 UTC 2012
On Wed, Jun 13, 2012 at 12:44 AM, Joseph A. Farran <jfarran at uci.edu> wrote:
> Hi Rayson.
> So for us newbies with OGE, is there a (hopefully easy) way of automatically
> adding cgroups to OGE parallel environment so that it's all nice and
> transparently integrated into OGE?
OGS/GE 2011.11 U1 will be the first release with cgroups, we mainly
use it for process grouping - and this part of transparent to the
user. Later releases, like GE 2011.11 update 2, will include features
that the user can use to tune the cgroups integration behavior.
In the current implementation (U1), when the freezer controller is
available, then it is used for safe signaling. If the cpuacct
controller is available, then it is used for CPU cycle accounting. And
if the memory controller is available, then memory limit & memory
usage is handled by this controller as well - as detection is done
without user intervention, it *should* be transparent enough! :-)
The most important part is grouping processes to jobs, which is the
main function of the PDC in Grid Engine. As Ron Chen & I implemented
almost half of the platform specific PDC code (AIX, HP-UX, OSX, and
other BSD-like systems use PDC that are mostly based on the OSX
implementation - FreeBSD, NetBSD, OpenBSD. We even wrote a
PDC-implementation for Linux that does not require running the execd
as root - and this one was contributed to the original dev list but
Sun was not too interested in it, and thus we only deployed it on a
few systems... - long story), we know the ugly bits in the PDC!
We believe the original PDC is something that really needs an update,
esp. now that the Linux kernel has cgroups that was developed for this
purpose... So we can finally remove hacks used in Grid Engine, such as
adding a GID to a job or needing to know the "ENABLE_ADDGRP_KILL" flag
for proper job cleanup. Note that the "ENABLE_ADDGRP_KILL" parameter
was added by Sun a long time ago, as (again!) Andy told us that it is
not always safe to kill all processes that has the supplementary GID
added by Grid Engine.
(Note that we have worked with Andy for a long time, and further there
was no reason that Sun wanted to screw up its own products... but the
result is that in some cases processes are left behind and not
properly cleaned up by Grid Engine.)
Lastly, in case you didn't know, we have a blog entry for the Grid
Engine cgroups integration:
> On 6/12/2012 5:19 PM, Rayson Ho wrote:
> On Tue, Jun 12, 2012 at 8:10 PM, Joseph Farran <jfarran at uci.edu> wrote:
> If you guys are that paranoid about PE suspension, how about adding an
> on/off flag for this since the code is already there and let the admin pick?
> Hi Joseph,
> I just want to understand the background a bit more, that's all...
> Esp. now we have cgroups that can handle suspension much safer than
> the old code (SIGSTOP).
More information about the users