[gridengine users] Memory errors even after setting h_vmem
simon.andrews at babraham.ac.uk
Tue Feb 24 14:47:52 UTC 2015
We've recently implemented a memory management system on our cluster which
requires that users set h_vmem on their jobs, and also tracks the
consumption of RAM on each compute node by setting h_vmem as a consumable
resource so we don't overcommit any nodes.
Despite this we're getting jobs which are dying due to not being able to
allocate memory. The nodes on which these failures happen still have
plenty of free memory and the jobs are dying from internal malloc errors,
rather than being killed due to the limit which was imposed by grid engine.
I suspect that what is happening is that we're getting memory
fragmentation, so that even though there is plenty of memory available,
the programs aren't able to allocate a large enough contiguous block of
memory and are therefore dying.
Does this seem like a likely explanation? If so, is there anything which
can be done in the configuration of either the queues or the nodes to try
to minimise the chances of these kinds of errors occurring?
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
More information about the users