[gridengine users] Round Robin x Fill Up

Reuti reuti at staff.uni-marburg.de
Sat Jul 27 13:13:45 UTC 2013


Hi,

Am 26.07.2013 um 23:26 schrieb Sergio Mafra:

> Hi Reuti,
> 
> Thanks for your prompt answer.
> Regarding yout questions:
> 
> > How does you application read the list of granted machines?
> > Did you compile MPI on your own (which implementation in detail)?
> 
> I´ve got no control or no documentation about this app. It was design by an Electrical Research Center for our proposes.
> 
> > PS: I assume that with $round_robin simply all (or at least: many) nodes were access allowed to.
> 
> Yes. It´s correct.
>  
> >As now hosts are first filled before access to another one is granted, you might see the >effect of the former (possibly wrong) distribution of slave tasks to the nodes
> 
> So I understand that the app should be recompiled to take advantages of $fill_up option?

No necessarily, the used version of MPI is obviously prepared to run under the control of SGE, as it uses `qrsh -inherit ...` to start slave tasks on other nodes. Unfortunately also on machines/slots which weren't granted for this job and results in the error you mentioned first.

Do you start in your job script any `mpiexec` resp. `mpirun` or is this issued already inside the application you started? The question is, whether there is any additional "-hostlist", "-machinefile" or alike given as argument to this command and invalidating the generated $PE_HOSTFILE of SGE.

The MPI library should detect the granted allocation automatically, as it honors already that it's started under SGE.

You can also try the following:

- revert the PE definition to allocate by $round_robin
- submit a job
- SSH to the master node of the parallel job
- issue:

ps -e f --cols=500

(f w/o -)

- somewhere should be the `mpiexec` resp. `mpirun` command. Can you please post this line, it should be a child of the started job script.

-- Reuti


> All the best,
> 
> Sergio
> 
> 
> On Fri, Jul 26, 2013 at 10:06 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 26.07.2013 um 14:22 schrieb Sergio Mafra:
> 
> > I'm using MIT StarCluster with mpich2 and OGE. Everything's ok.
> > But when I tried to change the strategy of distribution of work from Round Robin (default) to Fill Up... My problems had just began.
> > OGE keeps me teling that some nodes can not receive tasks...
> 
> On the one hand this is a good sign, as it confirms that your PE is defined to control slave tasks on the nodes.
> 
> 
> > "Error: executing task of job 9 failed: execution daemon on host "node002" didn't accept task"It seems that my mpi app always tries to run in all nodes of the cluster, no matter if OGE doesn't allow it to do it.
> > Does anybody knows of a workaround ?
> 
> This indicates, that you application tries to use a node in the cluster, which wasn't granted to this job by SGE.
> 
> How does you application read the list of granted machines?
> 
> Did you compile MPI on your own (which implementation in detail)?
> 
> -- Reuti
> 
> PS: I assume that with $round_robin simply all (or at least: many) nodes were access allowed to. As now hosts are first filled before access to another one is granted, you might see the effect of the former (possibly wrong) distribution of slave tasks to the nodes.
> 





More information about the users mailing list