[gridengine users] Fwd: error in parallel run openmpi for gridengine

Yong Wu wuy069 at gmail.com
Mon Apr 10 05:55:01 UTC 2017


Reuti,
  It's my mistake of changing another inputfile, and I'm unaware of the
difference ECP basis sets from ORCA 3.0.3 to ORCA 4.0, so a different error
message is encountered.
  It's working for adding an entry plm_rsh_agent=foo to the
openmpi-mca-params.conf file.
  Thanks very much!

Best regards,
Yong Wu

2017-04-10 0:04 GMT+08:00 Reuti <reuti at staff.uni-marburg.de>:

>
> Am 09.04.2017 um 15:47 schrieb Yong Wu:
>
> > Reuti,
> > Thanks for your reply again!
> >
> > >I can assure you, that for me and others it's working.
> > But it's not working for me.
> >
> > >Aha, I only set the $OMP_ROOT/etc/openmpi-mca-params.conf to have an
> entry plm_rsh_agent=foo to have it set for all users automatically.
> > >I didn't played with a source modification though.
> > >Nevertheless:
> > >Can you try with the original Open MPI 2.0.2 and call ORCA with:
> > >https://orcaforum.cec.mpg.de/viewtopic.php?f=9&t=2656
> > I add an entry plm_rsh_agent=foo to the path of openmpi-mca-params.conf
> (/share/apps/mpi/openmpi2.0.2-ifort/etc/openmpi-mca-params.conf), and
> resubmit the job, but get the error: "[file orca_main/mainchk.cpp, line
> 130]: Error (ORCA_MAIN): ... aborting the run."
> >
> > I enter the line "time /share/apps/orca4.0.0/orca test.inp "-mca
> plm_rsh_agent foo --bind-to none" > ${SGE_O_WORKDIR}/test.log"  instead of
> "time /share/apps/orca4.0.0/orca test.inp > ${SGE_O_WORKDIR}/test.log", and
> get the same error: "[file orca_main/mainchk.cpp, line 130]: Error
> (ORCA_MAIN): ... aborting the run."
>
> But this is now different from the original error message: one machine in
> the hostfile is not in the allocation. This error is gone?
>
> -- Reuti
>
>
> > >I'm not sure whether Open MPI will resolve the hostnames to their
> TCP/IP address, or does just a literal comparison - which fails.
> > When used the mpich PE, I modify the startmpi.sh of all compute nodes,
> > I change the line in your PeHostfile2MachineFile() subroutine:
> "host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`" to "host=`echo $line|cut
> -f1 -d" "`"
> >  and  resubmit the job, but get the error: "[file orca_main/mainchk.cpp,
> line 130]: Error (ORCA_MAIN): ... aborting the run."
> >
> > Best regards,
> > Yong Wu
> >
> > 2017-04-09 18:27 GMT+08:00 Reuti <reuti at staff.uni-marburg.de>:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Hi,
> >
> > Am 09.04.2017 um 11:14 schrieb Yong Wu:
> >
> > > Dear Reuti,
> > > Thank you very much!
> > > The jobname.nodes file is not necessary for parallel ORCA. And my
> "mpivars.sh" is also not a problem.
> > > ORCA3.0.3 program is compiled with openmpi-1.6.5, which can run
> normally on multiple node in gridengine.
> > > While ORCA4.0.0 program is compiled with openmpi-2.0.2, and cannot run
> on multiple node in gridengine.
> > > Maybe it is a bug of openmpi-2.0.x for the orca running on multiple
> node in gridengine.
> >
> > I can assure you, that for me and others it's working.
> >
> >
> > > I download the latest stable version of openmpi, but the error is also
> appeared in openmpi-2.1.0. The bug maybe not fixed in the latest stable
> version.
> > >
> > > >The Open MPI bug you checked already: https://www.mail-archive.com/
> users at lists.open-mpi.org/msg30824.html
> > > Thanks for your information. I read it, but I am not solve this
> problem. I modify the code file of "orte/mca/plm/rsh/plm_rsh_component.c"
> following this address:https://github.com/open-mpi/ompi/commit/
> dee2d8646d2e2055e2c86db9c207403366a2453d#diff-
> f556f53efc98e71d3bd13ee9945949fe
> > > and recompiled the openmpi, but has no effect.
> >
> > Aha, I only set the $OMP_ROOT/etc/openmpi-mca-params.conf to have an
> entry plm_rsh_agent=foo to have it set for all users automatically.
> >
> > I didn't played with a source modification though.
> >
> > Nevertheless:
> >
> > Can you try with the original Open MPI 2.0.2 and call ORCA with:
> >
> > https://orcaforum.cec.mpg.de/viewtopic.php?f=9&t=2656
> >
> >
> > > >Please change the line in your PeHostfile2MachineFile() subroutine:
> > > >host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
> > > >to:
> > > >host=`echo $line|cut -f1 -d" "`
> > > >This should leave the ".local" domain,
> > > This is also not a problem. Because of my “/etc/hosts”
> > >  10.1.1.1        cluster.local   cluster
> > >  10.1.255.254    compute-0-0.local       compute-0-0
> > >  10.1.255.253    compute-0-1.local       compute-0-1
> > >  10.1.255.244    compute-0-10.local      compute-0-10
> > >  10.1.255.243    compute-0-11.local      compute-0-11
> >
> > I'm not sure whether Open MPI will resolve the hostnames to their TCP/IP
> address, or does just a literal comparison - which fails.
> >
> > - -- Reuti
> > -----BEGIN PGP SIGNATURE-----
> > Comment: GPGTools - https://gpgtools.org
> >
> > iEYEARECAAYFAljqDHoACgkQo/GbGkBRnRo77QCgjcs9bKAKg0TPt2AUUOF3g/cb
> > /sIAn23dn3HaYNGZ7+dqULfMtXyOOlD1
> > =3uu2
> > -----END PGP SIGNATURE-----
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20170410/521b1fe1/attachment.html>


More information about the users mailing list