[gridengine users] h_vmem and parallel jobs, or "why exclusive=true is important"

Mark Dixon m.c.dixon at leeds.ac.uk
Thu Sep 22 10:41:09 UTC 2011

I've recently been dealing with a trouble ticket for one of our users. It 
led me down an interesting rabbit hole: what was happening wasn't 
surprising, but the scale of it was.

So I thought I would bore you with it :)

Here goes. Background first...

* We have had reports where OpenMPI jobs above a certain size are 
occasionally killed by our GE (ge6.2u5 plus the odd patch).

* Our compute cluster supports both serial and parallel computing: so each 
queue slot corresponds to a CPU core (instead of say, 1 slot per node).

* We make our users specify the virtual memory their jobs require (via 
h_vmem), to stop nodes from running out of memory. h_vmem isn't a perfect 
match for this (we should have a discussion on the technical merits of the 
alternative options sometime - I'm looking at you, William).

* We use tight integration to keep control of the parallel jobs; however, 
the principles below are mostly applicable to non-tightly integrated jobs.

It turned out that GE was killing the jobs because they had run out of 
vmem. We suggested they used "-l exclusive=true" with a few hand-wavy 
arguments to back it up and it started working again.

So this week, I finally got round to looking at exactly *why* 
exclusive=true fixed things...

How OpenMPI (and similar) interacts with Grid Engine: 

* Grid Engine constrains parallel jobs such that the virtual memory used 
on a compute node cannot exceed h_vmem * slots assigned to it on that 
compute node.

* The first compute node in a job runs the user's batch script, including 
the mpirun command.

* The mpirun command starts the copies of the MPI process for the first 
compute node, but also one "qrsh" command for each of the other nodes in 
the job. Each "qrsh" command runs for the lifetime of the job.

What this means:

The virtual memory overhead on the first node for a job is:

   overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem

   (nodes is the number of nodes assigned to the job)

And so the extra h_vmem the job needs to ask for is:

   h_vmem = overhead_vmem / node_slots

   (node_slots is the number of copies of the MPI program assigned to the
   first node)

An example job:

Looking at a real 256 core job that failed in MPI_Init, it happened to be 
allocated bits of 96 hosts and only one MPI process was assigned to the 
first node. The virtual memory overhead for the first node was therefore 
(in M):

   overhead_vmem = 66 + 59 + (96 -1)*18 = 1835M

So, as there was only one slot assigned to the first node, the extra 
per-slot h_vmem the job needed to ask for, for the topology it was 
assigned, is:

   h_vmem = 1835 / 1 = 1835M

Yikes. We cannot afford to request the best part of 2Gb (or more) per slot 
to a job _plus_ what the actual MPI program needs, in case we get an 
unfavourable distribution of slots.

How does exclusive=true help?

If the 256 core job was submitted with exclusive=true, it would have been 
allocated on our machine to 32 hosts, 8 processes per node. Running the 
numbers again for the first node:

   overhead_vmem = 66 + 59 + (32 -1)*18 = 683M

And the extra per-slot h_vmem required by the job to accommodate this 
overhead is:

   h_vmem = 683 / 8 = 85M

That's more like it!

Completing the story, the overheads on the non-first compute nodes are 
around the 65M per slot mark, so we have an even <100M/slot vmem overhead 
across the job.

So why don't you use exclusive=true by default?

There's collateral damage to other jobs:


I don't know if this has been fixed in any of the forks.

What else can be done?

We could reconfigure to use a JSV that re-writes the requested PE to 
select one that enforced the number of cores per node, adjusting for the 
amount of RAM. We didn't originally do this, because the exclusive=true 
feature seemed more simpler. Also, it's not that desirable for us, because 
we're already doing something very similar to encode interconnect 

Other avenues of attack to aid scalability, with varying levels of kludge:

* Replace qrsh with a 32-bit version (vmem gives a factor of 2 improvement 
in overhead (vmem comes down from 18M to 9M).

* Enhance GE to sort the hostlist such that the host with the greatest 
number of slots assigned to the job is first in the list, reducing the 
frequency that the problem is hit.

* Enhance GE by making qrsh more light-weight.

If you made it down to the bottom of this post, my thanks :)

Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK

More information about the users mailing list