[gridengine users] control_slaves on PE
reuti at staff.uni-marburg.de
Wed Jan 14 10:05:02 UTC 2015
Am 14.01.2015 um 10:09 schrieb Roberto Nunnari:
> man sge_pe states:
> This parameter can be set to TRUE or FALSE (the default). It indicates whether Oracle Grid Engine is the creator of the slave tasks of a parallel application via sge_execd(8) and sge_shepherd(8) and thus has full control over all processes in a parallel application, which enables capabilities such as resource limitation and correct accounting. However, to gain control over the slave tasks of a parallel application, a sophisticated PE interface is required, which works closely together with Oracle Grid Engine facilities. Such PE interfaces are available through your local Oracle Grid Engine support office.
> Does that mean that you need to buy some software from Oracle in order to take advantage of 'control_slaves TRUE' ?
It mainly refers to the fact that it depends on the parallel application whether any preparation might be necessary by supplying scripts for start/stop_proc_args and set up or tuning the started application not to do nasty things like jumping out of the process tree.
Technically its value must be set to TRUE to allow that a started job script is allowed to perform `qrsh --inherit ...` to reach other nodes without any `rsh`/`ssh` at all (in my clusters `ssh` is available for admin staff only).
While these scripts were mandatory for many parallel applications in the past, MPICH and Open MPI (./configure --with-sge for the latter) in the actual versions support SGE out of the box.
For Open MPI you can look for the value:
$ ompi_info | grep grid
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
whether it's set up in your version. Care must be taken with Open MPI 1.8 and newer as by default they issue a core binding independent from SGE's one and always start at socket/core 0/0, i.e. if more than one Open MPI job is running on a node it's necessary to either switch of Open MPI's core binding (and/or use SGE's one) or reformat the by SGE granted core list that it can be used by Open MPI.
> In my production environment, I have four PEs and two are set as 'control_slaves FALSE' and two 'control_slaves TRUE'.. and as long as I know, all of them behave as expected.. that has been like that for about 9 years, since I inherited the SGE cluster..
> Can anybody cast some light on it, please?
> my present environment:
> - OGE 6.2u7
> - on the execution nodes: openmpi 1.5.4
> - on the master node: openmpi 1.4
> Thank you and best regards.
> users mailing list
> users at gridengine.org
More information about the users