[gridengine users] MPI jobs spanning several nodes and h_vmem limits

Reuti reuti at staff.uni-marburg.de
Tue Mar 5 22:31:19 UTC 2013


Am 05.03.2013 um 18:45 schrieb Dave Love:

> Reuti <reuti at staff.uni-marburg.de> writes:
> 
>> The h_vmem isn't multiplied on the slave nodes even if you are getting
>> slots from one queue only, despite the fact that the correct value of
>> $NSLOTS on the slave node is known:
>> 
>> $ qsub -pe mpich 4 -l h_vmem=256M test.sh
>> $ cat test.sh.o5664
>> pc15370 2 all.q at pc15370 UNDEFINED
>> pc15381 2 all.q at pc15381 UNDEFINED
>> Script pc15370: /tmp/5664.1.all.q 4
>> ...
>> virtual memory          (kbytes, -v) 524288
>> ...
>> Call pc15370: /tmp/5664.1.all.q 4
>> ...
>> virtual memory          (kbytes, -v) 262144
>> ...
>> Call pc15381: /tmp/5664.1.all.q 2
>> ...
>> virtual memory          (kbytes, -v) 262144
>> ...
>> Call pc15381: /tmp/5664.1.all.q 2
>> ...
>> virtual memory          (kbytes, -v) 262144
>> ...
>> 
>> It should be 524288 also on pc15381, at least for the first call.
> 
> I can't reproduce that (with openmpi tight integration).  Doing this
> (which gets three four-core nodes):
> 
>  qsub -pe openmpi 12 -l h_vmem=256M
>  echo "Script $(hostname): $TMPDIR $NSLOTS"
>  ulimit -v
>  for HOST in $(tail -n +2 $PE_HOSTFILE|cut -f1 -d' '); do
>      qrsh -inherit $HOST 'echo "Call $(hostname): $TMPDIR $NSLOTS"; ulimit -v;
>      sleep 60' &
>  done
>  wait

Great, then you fixed it already for the actual version.

-- Reuti


> I see:
> 
>  Script node193: /tmp/179483.1.parallel 12
>  1048576
>  Call node228: /tmp/179483.1.parallel 4
>  1048576
>  Call node214: /tmp/179483.1.parallel 4
>  1048576
> 
> -- 
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/





More information about the users mailing list