[gridengine users] trouble running MPI jobs through SGE
mazouzi at gmail.com
Fri Apr 10 11:16:28 UTC 2015
It seems that your program needs more memory than requested. We have
trouble with VASP when it runs large problems.
Try something like : - l h_vmem=6G
On Fri, Apr 10, 2015 at 12:12 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 10.04.2015 um 04:51 schrieb Marlies Hankel <m.hankel at uq.edu.au>:
> > Dear all,
> > I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So
> the base config was done via the ROCKS install. The only changes I have
> made are setting the h_vmem complex to consumable and setting up a scratch
> complex. I have also set the h_vmem for all hosts.
> And the VASP job does work without h_vmem? We are using VASP too and have
> no problems with any set h_vmem.
> > I can run single CPU jobs fine and can execute simple things like
> > mpirun -np 40 hostname
> > but I cannot run proper MPI programs. I get the following error.
> > mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited
> on signal 11 (Segmentation fault).
> Are you using the correct `mpiexec` also during execution of a job, i.e.
> between the nodes - maybe the interactive login has a different $PATH set
> than inside a job script?
> And if it's from Open MPI: was the application compiled with the same
> version of Open MPI which's `mpiexec` is used later on on all nodes?
> > Basically the queues error logs on the head node and the execution nodes
> show nothing (/opt/gridengine/default/spool/../messages), also the .e, .o
> and .pe, .po also show nothing. The above error is in the standard output
> file of the program. I am trying VASP but have also tried a home grown MPI
> code. Both of these have been running out of the box via SGE for years on
> our old cluster (which was not ROCKS). I have tried the supplied orte PE
> (programs are compiled with openmpi 1.8.4
> The easiest would be to stay with Open MPI 1.6.5 as long as possible. In
> the 1.8 series they changed some things which might hinder a proper use:
> - The core binding is enabled by default in Open MPI 1.8. Having two MPI
> jobs on a node they may use the same cores and leave others idle. One can
> use "--bind-to none" and leave the binding of SGE in effect (if any). The
> behavior is different in that way, as SGE will give a job a set of cores,
> and the Linux scheduler is free to move the processes around inside this
> set. The native binding in Open MPI is per process (something SGE can't do
> of course, as Open MPI opens additional forks after the initial startup of
> `orted`. (Sure, the given set of cores by SGE could be rearranged to give
> this list to Open MPI).
> - Open MPI may scan the network before the actual jobs start to get all
> possible routes between the nodes. Depending on the network setup this may
> take 1-2 minutes.
> -- Reuti
> > compiled with intel and with --with-sge and --with-verbs) and have also
> tried one where I specify catch rsh and startmpi and stopmpi scripts but it
> made no difference. It seems as if the program does not even start. I am
> not even trying to run over several nodes yet.
> > Adding to that is that I can run the program (VASP) perfectly fine by
> ssh to a node and just running from the command line. And also over several
> nodes via a hostfile. So VASP itself is working fine.
> > I had a look at env and made sure ulimits are set OK (need ulimit -s
> unlimted for VASP to work) but all looks OK.
> > Has anyone seen this problem before? Or do you have any suggestion on
> what to do to get some info on where it actually goes wrong?
> > Thanks in advance
> > Marlies
> > --
> > ------------------
> > Dr. Marlies Hankel
> > Research Fellow, Theory and Computation Group
> > Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
> > eResearch Analyst, Research Computing Centre and Queensland Cyber
> Infrastructure Foundation
> > The University of Queensland
> > Qld 4072, Brisbane, Australia
> > Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
> > Email:
> > m.hankel at uq.edu.au | www.theory-computation.uq.edu.au
> > Notice: If you receive this e-mail by mistake, please notify me,
> > and do not make any use of its contents. I do not waive any
> > privilege, confidentiality or copyright associated with it. Unless
> > stated otherwise, this e-mail represents only the views of the
> > Sender and not the views of The University of Queensland.
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> users mailing list
> users at gridengine.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users