[gridengine users] problem in run mpi jobs

Reuti reuti at staff.uni-marburg.de
Tue Nov 22 14:33:43 UTC 2011


Am 22.11.2011 um 13:25 schrieb mahbube rustaee:

> On Mon, Nov 21, 2011 at 1:44 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 21.11.2011 um 05:30 schrieb mahbube rustaee:
> 
> > On Mon, Nov 21, 2011 at 3:27 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Hi,
> >
> > Am 20.11.2011 um 12:37 schrieb mahbube rustaee:
> >
> > > 1) I run intel mpi jobs. when $NSLOTS<=50 , qsub is ok, but for slots >50 either output is empty
> > > or output of job is:
> > >
> > > mpirun has exited due to process rank 4 with PID 23866 on
> > > node amd-7-5.local exiting without calling "finalize". This may
> > > have caused other processes in the application to be
> > > terminated by signals sent by mpirun (as reported here).
> > > --------------------------------------------------------------------------
> > > [amd-7-5.local:23861] 199 more processes have sent help message help-mtl-psm.txt / unable to open endpoint
> > > [amd-7-5.local:23861] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> > > [amd-7-5.local:23861] 99 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure
> > >
> > > what config is missed?
> >
> > the errors are from Open MPI, but above you state Intel MPI. Hence the $PATH on the exechost might point to the wrong `mpiexec`.
> >
> > You can investigate this by `which mpiexec` in your jobscript.
> >  I checked that, path of mpirun is correct. my script is:
> >  #!/bin/sh
> > #$ -S /bin/bash
> > #$ -N Det2-200core
> > #$ -cwd
> > #$ -l h_vmem=500M,mem_free=10M
> > #$ -j y
> > #$ -pe mpi16 64
> > . $HOME/.intelbash
> > .  /var/mpi-selector/data/openmpi_intel_qlc-1.4.2.sh
> > which mpirun
> 
> And what's the output?
> 
> 
> > mpirun -n $NSLOTS   mpi.intel.comp
> >
> > .intelbash and openmpi_intel_qlc-1.4.2.sh  set $PATH and library path .
> 
> As you can't setup two MPI libraries at the same time, I would assume that you missed an argument to the script.
> 
> Which library you used to compile the application? This one must be used for execution too.
> 
> those library don't confilit.

To be blund: they do. You can't run Intel MPI jobs with mpiexec from Open MPI and vice versa.


> I modify qsub script, run mpi-selector-menu for config of environment variables and
> add -V option at script. errors are:
> 
>  amd-10-9.local
> [amd-10-9.local:28409] [[44203,0],0] ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file oob_tcp.c at line 447

Again: this error is from Open MPI and you are mixing up them. Please check the actual PATH, LD_LIBRARY_PATH, mpiexec and so on...


> --------------------------------------------------------------------------
> Error: system limit exceeded on number of network connections that can be open
> 
> This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
> increasing your limit descriptor setting (using limit or ulimit commands),
> or asking the system administrator to increase the system limit.
> --------------------------------------------------------------------------
> 
> but I can run job via CLI :
> I did ssh to master node that sge ran job and copy .po file to machinefile
> and run mpirun with the machinefile(same sge host file ) and run is succesful.

If it's run from the command line you have a) a different environment, b) no tight integration of the job into SGE.

-- Reuti


> why I can run a  mpi job directly (via CLI) and sge cannot?
> 
> 
> -- Reuti
> 
> 
> > -- Reuti
> >
> >
> > > 2) when I run a job directly via CLI, depend on number of slots  also  program ,output is correct !
> > > I think some config on OS and SGE is missed!
> > >
> > > Thx
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 




More information about the users mailing list