[gridengine users] Fwd: error in parallel run openmpi for gridengine

Reuti reuti at staff.uni-marburg.de
Sun Apr 9 10:27:06 UTC 2017


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Am 09.04.2017 um 11:14 schrieb Yong Wu:

> Dear Reuti,
> Thank you very much!
> The jobname.nodes file is not necessary for parallel ORCA. And my "mpivars.sh" is also not a problem.
> ORCA3.0.3 program is compiled with openmpi-1.6.5, which can run normally on multiple node in gridengine.
> While ORCA4.0.0 program is compiled with openmpi-2.0.2, and cannot run on multiple node in gridengine.
> Maybe it is a bug of openmpi-2.0.x for the orca running on multiple node in gridengine.

I can assure you, that for me and others it's working.


> I download the latest stable version of openmpi, but the error is also appeared in openmpi-2.1.0. The bug maybe not fixed in the latest stable version.
> 
> >The Open MPI bug you checked already: https://www.mail-archive.com/users@lists.open-mpi.org/msg30824.html
> Thanks for your information. I read it, but I am not solve this problem. I modify the code file of "orte/mca/plm/rsh/plm_rsh_component.c" following this address:https://github.com/open-mpi/ompi/commit/dee2d8646d2e2055e2c86db9c207403366a2453d#diff-f556f53efc98e71d3bd13ee9945949fe
> and recompiled the openmpi, but has no effect.

Aha, I only set the $OMP_ROOT/etc/openmpi-mca-params.conf to have an entry plm_rsh_agent=foo to have it set for all users automatically.

I didn't played with a source modification though.

Nevertheless:

Can you try with the original Open MPI 2.0.2 and call ORCA with:

https://orcaforum.cec.mpg.de/viewtopic.php?f=9&t=2656


> >Please change the line in your PeHostfile2MachineFile() subroutine:
> >host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
> >to:
> >host=`echo $line|cut -f1 -d" "`
> >This should leave the ".local" domain,
> This is also not a problem. Because of my “/etc/hosts”
>  10.1.1.1        cluster.local   cluster
>  10.1.255.254    compute-0-0.local       compute-0-0
>  10.1.255.253    compute-0-1.local       compute-0-1
>  10.1.255.244    compute-0-10.local      compute-0-10
>  10.1.255.243    compute-0-11.local      compute-0-11

I'm not sure whether Open MPI will resolve the hostnames to their TCP/IP address, or does just a literal comparison - which fails.

- -- Reuti
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAljqDHoACgkQo/GbGkBRnRo77QCgjcs9bKAKg0TPt2AUUOF3g/cb
/sIAn23dn3HaYNGZ7+dqULfMtXyOOlD1
=3uu2
-----END PGP SIGNATURE-----




More information about the users mailing list