[gridengine users] MPI jobs spanning several nodes and h_vmem limits

Reuti reuti at staff.uni-marburg.de
Fri Mar 1 12:01:15 UTC 2013


Am 01.03.2013 um 12:13 schrieb Dave Love:

> Reuti <reuti at staff.uni-marburg.de> writes:
> 
>> Am 27.02.2013 um 20:56 schrieb Mikael Brandström Durling:
>>> <snip>
>>>> 
>>>> In case you look deeper into the issue, it's also worth to note that there is no option to specify the target queue for `qrsh -inherit` in case you get slots from different queues on the slave system:
>>>> 
>>>> https://arc.liv.ac.uk/trac/SGE/ticket/813
>>> 
>>> Ok. This could lead to incompatible changes to the -inherit behaviour, if the caller to `qrsh -inherit` has to specify the queue requested. On the other hand, I have seen cases where an OMPI job has been allotted slots from two different queues on an exec host, which has resulted in ompi launching two `qrsh -inherit` to the same host.
> 
> In my limited experience, you really don't want to split parallel jobs
> across queues (and you only add queues if there's something you have to
> hang off them).
> 
> I don't really understand what the complaint is here otherwise.  OMPI
> with h_vmem enforced works reasonably well for us (with a single queue).

The h_vmem isn't multiplied on the slave nodes even if you are getting slots from one queue only, despite the fact that the correct value of $NSLOTS on the slave node is known:

$ qsub -pe mpich 4 -l h_vmem=256M test.sh
$ cat test.sh.o5664
pc15370 2 all.q at pc15370 UNDEFINED
pc15381 2 all.q at pc15381 UNDEFINED
Script pc15370: /tmp/5664.1.all.q 4
...
virtual memory          (kbytes, -v) 524288
...
Call pc15370: /tmp/5664.1.all.q 4
...
virtual memory          (kbytes, -v) 262144
...
Call pc15381: /tmp/5664.1.all.q 2
...
virtual memory          (kbytes, -v) 262144
...
Call pc15381: /tmp/5664.1.all.q 2
...
virtual memory          (kbytes, -v) 262144
...

It should be 524288 also on pc15381, at least for the first call.

-- Reuti

Used script:

#!/bin/sh
cat $PE_HOSTFILE
. /usr/sge/default/common/settings.sh
echo "Script $(hostname): $TMPDIR $NSLOTS"
ulimit -aH
for HOST in $(tail -n +2 $TMPDIR/machines); do
    qrsh -inherit $HOST 'echo "Call $HOST: $TMPDIR $NSLOTS"; ulimit -aH; sleep 60' &
done
wait


>> This was a bug and is fixed in the meantime from Open MPI 1.5.5 on.
>> 
>> https://svn.open-mpi.org/trac/ompi/changeset/26163
>> 
>> It will always add up all slots for a machine even if they come from different queues now.
> 
> You'll still get potential confusion from different TMPDIRs, though.  I
> never established whether there was any problem replacing the queue name
> with the cell name in TMPDIR construction, but I have a patch lying
> around to do it.
> 
>>> I'll think of this and add it as a comment to the ticket. Is that
>>> trac instance at arc.liv.ac.uk the best place, even though we are
>>> running OGS? I suppose so?
> 
> I'd be happy to have reports that might improve SGE (if I or someone
> else understands the issue), but I'm afraid I've been flamed for trying
> to help OGS users.
> 
> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 





More information about the users mailing list