[gridengine users] open mpi problem
Reuti
reuti at staff.uni-marburg.de
Thu Mar 10 21:58:17 UTC 2011
Am 10.03.2011 um 22:43 schrieb Gavin W. Burris:
> I already had build ompi 1.4.1 with those options. Originally, I
> suspected two compute nodes that I had to kickstart / re-image because
> of bad drives, thinking they were slightly different than all nodes,
> maybe a library mismatch.
Ok.
> I have now installed the latest Grid Engine and Open MPI 1.4.2. I was
> still getting the same error, though. After returning to it a few hours
> later, things are looking OK. Weird...
You are just using a plain "mpirun ~/MPI/test"? Then we have to check the setting for the start of slave tasks where e.g. ROCKS fills in something stupid by default. Can you please post:
$ qconf -sconf
-- Reuti
> Thanks again!
>
> On 03/10/2011 01:51 PM, Reuti wrote:
>> Am 10.03.2011 um 19:38 schrieb Gavin W. Burris:
>>
>>> Has anyone had a similar problem to this? Note that each node has 16
>>> slots, so 17 is utilizing interconnect. A simple Open MPI hello world
>>> works as expected:
>>> $ mpirun --machinefile /etc/machines.list -np 17 ~/MPI/test
>>
>> Yep, Open MPI needs to be compiled "--with-sge --with-openib=<dir>" http://icl.cs.utk.edu/open-mpi/faq/?category=building#build-p2p
>>
>> Then a plain "mpirun ~/MPI/test" will route the job to the slots granted by SGE for the job automatically.
>>
>> To be sure that IB is used you can disable the tcp interface: "mpirun --mca btl ^tcp ~/MPI/test".
>>
>> -- Reuti
>>
>>
>>> Hello World from Node 16
>>> Hello World from Node 9
>>> Hello World from Node 3
>>> Hello World from Node 2
>>> Hello World from Node 5
>>> Hello World from Node 12
>>> Hello World from Node 11
>>> Hello World from Node 8
>>> Hello World from Node 15
>>> Hello World from Node 7
>>> Hello World from Node 1
>>> Hello World from Node 4
>>> Hello World from Node 0
>>> Hello World from Node 10
>>> Hello World from Node 13
>>> Hello World from Node 14
>>> Hello World from Node 6
>>>
>>>
>>> But with grid engine I get these errors:
>>> $ qrsh -verbose -V -q all.q -pe ompi 17 mpirun -np 17 ~/MPI/test
>>> Your job 23 ("mpirun") has been submitted
>>> waiting for interactive job to be scheduled ...
>>> Your interactive job 23 has been successfully scheduled.
>>> Establishing builtin session to host node17 ...
>>> node17:16.0.ErrPkt: Received packet for context 31 on context 16.
>>> Receive Header Queue offset: 0x0. Exiting.
>>>
>>>
>>> test:29432 terminated with signal 6 at PC=3a67e30265 SP=7fff250173a8.
>>> Backtrace:
>>> /lib64/libc.so.6(gsignal+0x35)[0x3a67e30265]
>>> /lib64/libc.so.6(abort+0x110)[0x3a67e31d10]
>>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b35940]
>>> /usr/lib64/libpsm_infinipath.so.1(psmi_handle_error+0x237)[0x2b15b8b35b87]
>>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b4ba3d]
>>> /usr/lib64/libpsm_infinipath.so.1(ips_ptl_poll+0x9b)[0x2b15b8b49c5b]
>>> /usr/lib64/libpsm_infinipath.so.1(psmi_poll_internal+0x50)[0x2b15b8b49b30]
>>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b2fc51]
>>> /usr/lib64/libpsm_infinipath.so.1[0x2b15b8b30594]
>>> /usr/lib64/libpsm_infinipath.so.1(__psm_ep_connect+0x32e)[0x2b15b8b34ece]
>>> /usr/lib64/openmpi/mca_mtl_psm.so[0x2b15b890ec45]
>>> /usr/lib64/openmpi/mca_pml_cm.so[0x2b15b80d96d4]
>>> /usr/lib64/libmpi.so.0[0x3c2ca36179]
>>> /usr/lib64/libmpi.so.0(MPI_Init+0xf0)[0x3c2ca531c0]
>>> /data0/home/bug/MPI/test(main+0x1c)[0x400844]
>>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x3a67e1d994]
>>> /data0/home/bug/MPI/test[0x400779]
>>> --------------------------------------------------------------------------
>>> mpirun has exited due to process rank 12 with PID 29432 on
>>> node node17 exiting without calling "finalize". This may
>>> have caused other processes in the application to be
>>> terminated by signals sent by mpirun (as reported here).
>>> --------------------------------------------------------------------------
>>>
>>> Simple hostname commands work either way. The combination of grid
>>> engine and open mpi seem to be failing. Any pointers are much appreciated.
>>>
>>> Cheers,
>>> --
>>> Gavin W. Burris
>>> Senior Systems Programmer
>>> Information Security and Unix Systems
>>> School of Arts and Sciences
>>> University of Pennsylvania
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>>
>
> --
> Gavin W. Burris
> Senior Systems Programmer
> Information Security and Unix Systems
> School of Arts and Sciences
> University of Pennsylvania
More information about the users
mailing list