[gridengine users] PE Job Suspend / Resume
Joseph A. Farran
jfarran at uci.edu
Tue Jun 12 04:58:25 UTC 2012
Yes it makes sense not to introduce new options.
I am not familiar with cgroups, so I need to read up on it.
On the subject of OpenMPI and OGE - does OGE correctly suspend and resumes programs compiled with OpenMPI using the OpenMPI s/r implementation?
On 6/11/2012 9:21 PM, Ron Chen wrote:
> We have not implemented a flag for it, and it is not hard to add one. One thing about adding a new option is, we will then need to support it even if it turns out to be not needed, and we are careful not to add too much extra code, and that's why I will do more research first and decide if it is really needed.
> I Google searched for TCP suspend issues, and found that some developers say that it is safe if the processes are suspended when they are at a quiescent point.
> So if in-flight messages are processed first before suspending, which should be the case for the freezer cgroup subsystem, then it should be safe to handle it without adding a new flag.
> See: http://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt
> (And Rayson added cgroups support in GE 2011.11 U1, while cgroups is Linux only, Linux is run by most of the clusters, at least doing small to medium-scale HPC.)
> IBM also planned to use Containers/Cgroups in IBM BlueWaters (before IBM cancelled the project in 2011) to perform checkpointing and restart.
More information about the users