[gridengine users] trouble running MPI jobs through SGE
m.hankel at uq.edu.au
Fri Apr 10 02:51:42 UTC 2015
I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So
the base config was done via the ROCKS install. The only changes I have
made are setting the h_vmem complex to consumable and setting up a
scratch complex. I have also set the h_vmem for all hosts.
I can run single CPU jobs fine and can execute simple things like
mpirun -np 40 hostname
but I cannot run proper MPI programs. I get the following error.
mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited
on signal 11 (Segmentation fault).
Basically the queues error logs on the head node and the execution nodes
show nothing (/opt/gridengine/default/spool/../messages), also the .e,
.o and .pe, .po also show nothing. The above error is in the standard
output file of the program. I am trying VASP but have also tried a home
grown MPI code. Both of these have been running out of the box via SGE
for years on our old cluster (which was not ROCKS). I have tried the
supplied orte PE (programs are compiled with openmpi 1.8.4 compiled with
intel and with --with-sge and --with-verbs) and have also tried one
where I specify catch rsh and startmpi and stopmpi scripts but it made
no difference. It seems as if the program does not even start. I am not
even trying to run over several nodes yet.
Adding to that is that I can run the program (VASP) perfectly fine by
ssh to a node and just running from the command line. And also over
several nodes via a hostfile. So VASP itself is working fine.
I had a look at env and made sure ulimits are set OK (need ulimit -s
unlimted for VASP to work) but all looks OK.
Has anyone seen this problem before? Or do you have any suggestion on
what to do to get some info on where it actually goes wrong?
Thanks in advance
Dr. Marlies Hankel
Research Fellow, Theory and Computation Group
Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
eResearch Analyst, Research Computing Centre and Queensland Cyber Infrastructure Foundation
The University of Queensland
Qld 4072, Brisbane, Australia
Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
Email: m.hankel at uq.edu.au | www.theory-computation.uq.edu.au
Notice: If you receive this e-mail by mistake, please notify me,
and do not make any use of its contents. I do not waive any
privilege, confidentiality or copyright associated with it. Unless
stated otherwise, this e-mail represents only the views of the
Sender and not the views of The University of Queensland.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users