[gridengine users] problem in run mpi jobs

mahbube rustaee rustaee at gmail.com
Tue Nov 22 12:25:51 UTC 2011


On Mon, Nov 21, 2011 at 1:44 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

> Am 21.11.2011 um 05:30 schrieb mahbube rustaee:
>
> > On Mon, Nov 21, 2011 at 3:27 AM, Reuti <reuti at staff.uni-marburg.de>
> wrote:
> > Hi,
> >
> > Am 20.11.2011 um 12:37 schrieb mahbube rustaee:
> >
> > > 1) I run intel mpi jobs. when $NSLOTS<=50 , qsub is ok, but for slots
> >50 either output is empty
> > > or output of job is:
> > >
> > > mpirun has exited due to process rank 4 with PID 23866 on
> > > node amd-7-5.local exiting without calling "finalize". This may
> > > have caused other processes in the application to be
> > > terminated by signals sent by mpirun (as reported here).
> > >
> --------------------------------------------------------------------------
> > > [amd-7-5.local:23861] 199 more processes have sent help message
> help-mtl-psm.txt / unable to open endpoint
> > > [amd-7-5.local:23861] Set MCA parameter "orte_base_help_aggregate" to
> 0 to see all help / error messages
> > > [amd-7-5.local:23861] 99 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> > >
> > > what config is missed?
> >
> > the errors are from Open MPI, but above you state Intel MPI. Hence the
> $PATH on the exechost might point to the wrong `mpiexec`.
> >
> > You can investigate this by `which mpiexec` in your jobscript.
> >  I checked that, path of mpirun is correct. my script is:
> >  #!/bin/sh
> > #$ -S /bin/bash
> > #$ -N Det2-200core
> > #$ -cwd
> > #$ -l h_vmem=500M,mem_free=10M
> > #$ -j y
> > #$ -pe mpi16 64
> > . $HOME/.intelbash
> > .  /var/mpi-selector/data/openmpi_intel_qlc-1.4.2.sh
> > which mpirun
>
> And what's the output?
>
>
> > mpirun -n $NSLOTS   mpi.intel.comp
> >
> > .intelbash and openmpi_intel_qlc-1.4.2.sh  set $PATH and library path .
>
> As you can't setup two MPI libraries at the same time, I would assume that
> you missed an argument to the script.
>
> Which library you used to compile the application? This one must be used
> for execution too.
>

those library don't confilit. I modify qsub script, run mpi-selector-menu
for config of environment variables and
add -V option at script. errors are:

 amd-10-9.local
[amd-10-9.local:28409] [[44203,0],0] ORTE_ERROR_LOG: The system limit on
number of network connections a process can open was reached in file
oob_tcp.c at line 447
--------------------------------------------------------------------------
Error: system limit exceeded on number of network connections that can be
open

This can be resolved by setting the mca parameter opal_set_max_sys_limits
to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--------------------------------------------------------------------------

but I can run job via CLI :
I did ssh to master node that sge ran job and copy .po file to machinefile
and run mpirun with the machinefile(same sge host file ) and run is
succesful.

why I can run a  mpi job directly (via CLI) and sge cannot?


> -- Reuti
>
>
> > -- Reuti
> >
> >
> > > 2) when I run a job directly via CLI, depend on number of slots  also
>  program ,output is correct !
> > > I think some config on OS and SGE is missed!
> > >
> > > Thx
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20111122/29406544/attachment.html>


More information about the users mailing list