[gridengine users] controlling where an MPICH2 job is started from?

John Young John.E.Young at NASA.Gov
Wed Jun 1 13:04:25 UTC 2011


We have a heterogeneous grid with two types of execution hosts --
some newer nodes with 32 cores and 2Gb of memory per core and
some older nodes with two cores and 2-to-4 Gb of memory per core.

We have an engineer who submits parallel jobs using MPICH2 to the
grid.  While we have not yet figured out exactly what is happening,
the following is empirically observable:

If the 'master' node for the MPICH2 job is assigned to one of the
newer nodes, everything works fine and the job runs, even if some
of the older nodes are used as part of the computation.

If the 'master' node for the MPICH2 job is assigned to one of the
older nodes, the job dies with an error that says that 'too many
files are open'.  I am guessing that this is a resource issue,
possibly due to the lower total amount of memory available on the
older nodes.

So the question is, how can I force the master node to always be
one of the newer nodes?  It is fine if the job uses a mix of old
and new nodes -- in fact, we *want* it to, but we want the 'master'
node to always be one of the newer ones.

I have set up a couple of hostgroups corresponding to the old and new
nodes, but if I specify the new hostgroup as a requirement, the job
only runs on the new nodes and not on all of them.  What sort of resource
requirement can I set up that will only apply to the master node and not
to all of the execution hosts??

JY



More information about the users mailing list