[gridengine users] Round Robin x Fill Up

Reuti reuti at staff.uni-marburg.de
Sat Jul 27 18:58:04 UTC 2013


Am 27.07.2013 um 16:25 schrieb Sergio Mafra:

> Reuti,
> 
> Aggregating all data...
> 
> My cluster has 2 servers (master and node001), with 16 slots each one.
> 
> My mpi app is newave170502_L
> 
> I ran 3 tests:
> 
> 1. $round_robin using 32 slots: (ran ok)
> 
>  2382 ?        Sl     0:00 /opt/sge6/bin/linux-x64/sge_execd
>  2817 ?        S      0:00  \_ sge_shepherd-1 -bg
>  2819 ?        Ss     0:00      \_ mpiexec newave170502_L
>  2820 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 0
>  2822 ?        R      0:30          |   \_ newave170502_L
>  2821 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:40945 --demux poll --pgid 0 --ret

As both nodes are used, this will succeed. I wonder why there is only one `newave170502` process. It should show 16 on each machine as child of the particular `hydra_pmi_proxy`.

What is the output of:

mpiexec --version

Maybe the application is using threads in addition. Does:

ps -eLf

list more instances of the application?


> 2. $fill_up with 16 slots: (aborted with error error: executing task of job 2 failed: execution daemon on host "node001" didn't accept task)
> 
>  2842 ?        S      0:00  \_ sge_shepherd-2 -bg
>  2844 ?        Ss     0:00      \_ mpiexec newave170502_L
>  2845 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy --control-port master:45562 --demux poll --pgid 0 --retries 10 --proxy-id 0
>  2847 ?        S      0:00          |   \_ newave170502_L
>  2846 ?        Z      0:00          \_ [qrsh] <defunct>

SGE allocated all slots to the "master" and none to "node001", as the submitted job can get the required amount of slots from only one machine, there is no need to spread another task on "node001". They question is: why is your application (or even the `mpiexec`) trying to do so? There were cases, where SGE was misled due to contradictory entries in:

/etc/hosts

having two or more different names for each machine.

- What is the content of this file in your machines?

- Is 

> 3. $fill_up with 18 slots (ran ok):
> 
>  2382 ?        Sl     0:01 /opt/sge6/bin/linux-x64/sge_execd
>  2861 ?        Sl     0:00  \_ sge_shepherd-3 -bg
>  2862 ?        Ss     0:00      \_ /opt/sge6/utilbin/linux-x64/qrsh_starter /opt/sge6/default/spool/exec_spool_local/master/active_jobs/3.1/1.master
>  2869 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy --control-port node001:36673 --demux poll --pgid 0 --retries 10 --proxy-id 0
>  2870 ?        R      0:24              \_ newave170502_L

While in former times (with the old MPICH(1)) each slave task need its own `qrsh --inherit ...`, nowadays only one is used and all additional processes on the master or any slave node are forks.

I guess even 17 would work, as it would need at least one slot from the other machine.

- Is there any comment in the output of your application, how many processes were started for a computation?

- Is the `mpiexec` a plain binary, or some kind of wrapper script?

file `which mpiexec`

If it's a symbolic link, it should point to mpiexec.hydra and the inquiry can be repeated.

-- Reuti


> ---------- Forwarded message ----------
> From: Sergio Mafra <sergiohmafra at gmail.com>
> Date: Sat, Jul 27, 2013 at 11:07 AM
> Subject: Fwd: [gridengine users] Round Robin x Fill Up
> To: Reuti <reuti at staff.uni-marburg.de>, "users at gridengine.org" <users at gridengine.org>
> 
> 
> Appending to previous message.
> 
> If I change to $fill_up and submit the same job using only 16 slots of 32 available slots. here comes the output:
> 
>  2842 ?        S      0:00  \_ sge_shepherd-2 -bg
>  2844 ?        Ss     0:00      \_ mpiexec newave170502_L
>  2845 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy --control-port master:45562 --demux poll --pgid 0 --retries 10 --proxy-id 0
>  2847 ?        S      0:00          |   \_ newave170502_L
>  2846 ?        Z      0:00          \_ [qrsh] <defunct>
> ---------- Forwarded message ----------
> From: Sergio Mafra <sergiohmafra at gmail.com>
> Date: Sat, Jul 27, 2013 at 10:58 AM
> Subject: Re: [gridengine users] Round Robin x Fill Up
> To: Reuti <reuti at staff.uni-marburg.de>
> Cc: "users at gridengine.org" <users at gridengine.org>
> 
> 
> Hi Reuti,
> 
> >Do you start in your job script any `mpiexec` resp. `mpirun` or is this issued already inside >the application you started? The question is, whether there is any additional "-hostlist", "->machinefile" or alike given as argument to this command and invalidating the generated >$PE_HOSTFILE of SGE.
> 
> The job is started using mpiexec, in this way:
> $ qsub -N $nameofthecase -b y -pe orte $1 -cwd mpiexec newave170502_L
> where newave170502_L is the name of mpi app.
> 
> >You can also try the following:
> >
> >- revert the PE definition to allocate by $round_robin
> >- submit a job
> >- SSH to the master node of the parallel job
> >- issue:
> >
> >ps -e f --cols=500
> >
> >(f w/o -)
> 
> >- somewhere should be the `mpiexec` resp. `mpirun` command. Can you please post >this line, it should be a child of the started job script.
> 
> Here comes the output:
> 
> 2382 ?        Sl     0:00 /opt/sge6/bin/linux-x64/sge_execd
>  2817 ?        S      0:00  \_ sge_shepherd-1 -bg
>  2819 ?        Ss     0:00      \_ mpiexec newave170502_L
>  2820 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 0
>  2822 ?        R      0:30          |   \_ newave170502_L
>  2821 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 1
> 
> All best,
> 
> Sergio
> 
> 
> On Sat, Jul 27, 2013 at 10:13 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
> 
> Am 26.07.2013 um 23:26 schrieb Sergio Mafra:
> 
> > Hi Reuti,
> >
> > Thanks for your prompt answer.
> > Regarding yout questions:
> >
> > > How does you application read the list of granted machines?
> > > Did you compile MPI on your own (which implementation in detail)?
> >
> > I´ve got no control or no documentation about this app. It was design by an Electrical Research Center for our proposes.
> >
> > > PS: I assume that with $round_robin simply all (or at least: many) nodes were access allowed to.
> >
> > Yes. It´s correct.
> >
> > >As now hosts are first filled before access to another one is granted, you might see the >effect of the former (possibly wrong) distribution of slave tasks to the nodes
> >
> > So I understand that the app should be recompiled to take advantages of $fill_up option?
> 
> No necessarily, the used version of MPI is obviously prepared to run under the control of SGE, as it uses `qrsh -inherit ...` to start slave tasks on other nodes. Unfortunately also on machines/slots which weren't granted for this job and results in the error you mentioned first.
> 
> Do you start in your job script any `mpiexec` resp. `mpirun` or is this issued already inside the application you started? The question is, whether there is any additional "-hostlist", "-machinefile" or alike given as argument to this command and invalidating the generated $PE_HOSTFILE of SGE.
> 
> The MPI library should detect the granted allocation automatically, as it honors already that it's started under SGE.
> 
> You can also try the following:
> 
> - revert the PE definition to allocate by $round_robin
> - submit a job
> - SSH to the master node of the parallel job
> - issue:
> 
> ps -e f --cols=500
> 
> (f w/o -)
> 
> - somewhere should be the `mpiexec` resp. `mpirun` command. Can you please post this line, it should be a child of the started job script.
> 
> -- Reuti
> 
> 
> > All the best,
> >
> > Sergio
> >
> >
> > On Fri, Jul 26, 2013 at 10:06 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > Am 26.07.2013 um 14:22 schrieb Sergio Mafra:
> >
> > > I'm using MIT StarCluster with mpich2 and OGE. Everything's ok.
> > > But when I tried to change the strategy of distribution of work from Round Robin (default) to Fill Up... My problems had just began.
> > > OGE keeps me teling that some nodes can not receive tasks...
> >
> > On the one hand this is a good sign, as it confirms that your PE is defined to control slave tasks on the nodes.
> >
> >
> > > "Error: executing task of job 9 failed: execution daemon on host "node002" didn't accept task"It seems that my mpi app always tries to run in all nodes of the cluster, no matter if OGE doesn't allow it to do it.
> > > Does anybody knows of a workaround ?
> >
> > This indicates, that you application tries to use a node in the cluster, which wasn't granted to this job by SGE.
> >
> > How does you application read the list of granted machines?
> >
> > Did you compile MPI on your own (which implementation in detail)?
> >
> > -- Reuti
> >
> > PS: I assume that with $round_robin simply all (or at least: many) nodes were access allowed to. As now hosts are first filled before access to another one is granted, you might see the effect of the former (possibly wrong) distribution of slave tasks to the nodes.
> >
> 
> 
> 
> 




More information about the users mailing list