[gridengine users] h_vmem and parallel jobs, or "why exclusive=true is important"
m.c.dixon at leeds.ac.uk
Thu Sep 22 10:41:09 UTC 2011
I've recently been dealing with a trouble ticket for one of our users. It
led me down an interesting rabbit hole: what was happening wasn't
surprising, but the scale of it was.
So I thought I would bore you with it :)
Here goes. Background first...
* We have had reports where OpenMPI jobs above a certain size are
occasionally killed by our GE (ge6.2u5 plus the odd patch).
* Our compute cluster supports both serial and parallel computing: so each
queue slot corresponds to a CPU core (instead of say, 1 slot per node).
* We make our users specify the virtual memory their jobs require (via
h_vmem), to stop nodes from running out of memory. h_vmem isn't a perfect
match for this (we should have a discussion on the technical merits of the
alternative options sometime - I'm looking at you, William).
* We use tight integration to keep control of the parallel jobs; however,
the principles below are mostly applicable to non-tightly integrated jobs.
It turned out that GE was killing the jobs because they had run out of
vmem. We suggested they used "-l exclusive=true" with a few hand-wavy
arguments to back it up and it started working again.
So this week, I finally got round to looking at exactly *why*
exclusive=true fixed things...
How OpenMPI (and similar) interacts with Grid Engine:
* Grid Engine constrains parallel jobs such that the virtual memory used
on a compute node cannot exceed h_vmem * slots assigned to it on that
* The first compute node in a job runs the user's batch script, including
the mpirun command.
* The mpirun command starts the copies of the MPI process for the first
compute node, but also one "qrsh" command for each of the other nodes in
the job. Each "qrsh" command runs for the lifetime of the job.
What this means:
The virtual memory overhead on the first node for a job is:
overhead_vmem = bash_vmem + mpirun_vmem + (nodes -1)*qrsh_vmem
(nodes is the number of nodes assigned to the job)
And so the extra h_vmem the job needs to ask for is:
h_vmem = overhead_vmem / node_slots
(node_slots is the number of copies of the MPI program assigned to the
An example job:
Looking at a real 256 core job that failed in MPI_Init, it happened to be
allocated bits of 96 hosts and only one MPI process was assigned to the
first node. The virtual memory overhead for the first node was therefore
overhead_vmem = 66 + 59 + (96 -1)*18 = 1835M
So, as there was only one slot assigned to the first node, the extra
per-slot h_vmem the job needed to ask for, for the topology it was
h_vmem = 1835 / 1 = 1835M
Yikes. We cannot afford to request the best part of 2Gb (or more) per slot
to a job _plus_ what the actual MPI program needs, in case we get an
unfavourable distribution of slots.
How does exclusive=true help?
If the 256 core job was submitted with exclusive=true, it would have been
allocated on our machine to 32 hosts, 8 processes per node. Running the
numbers again for the first node:
overhead_vmem = 66 + 59 + (32 -1)*18 = 683M
And the extra per-slot h_vmem required by the job to accommodate this
h_vmem = 683 / 8 = 85M
That's more like it!
Completing the story, the overheads on the non-first compute nodes are
around the 65M per slot mark, so we have an even <100M/slot vmem overhead
across the job.
So why don't you use exclusive=true by default?
There's collateral damage to other jobs:
I don't know if this has been fixed in any of the forks.
What else can be done?
We could reconfigure to use a JSV that re-writes the requested PE to
select one that enforced the number of cores per node, adjusting for the
amount of RAM. We didn't originally do this, because the exclusive=true
feature seemed more simpler. Also, it's not that desirable for us, because
we're already doing something very similar to encode interconnect
Other avenues of attack to aid scalability, with varying levels of kludge:
* Replace qrsh with a 32-bit version (vmem gives a factor of 2 improvement
in overhead (vmem comes down from 18M to 9M).
* Enhance GE to sort the hostlist such that the host with the greatest
number of slots assigned to the job is first in the list, reducing the
frequency that the problem is hit.
* Enhance GE by making qrsh more light-weight.
If you made it down to the bottom of this post, my thanks :)
Mark Dixon Email : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
More information about the users