[gridengine users] Monitoring slot usage

Simon Andrews simon.andrews at babraham.ac.uk
Thu Jul 30 15:55:00 UTC 2015


Thanks, core binding looks like it does what we need.  Do I understand correctly that if a process spawns more threads than slots that it will then just restrict those threads to the core it’s been allocated, so they’ll just end up slowing down their own job, and that it won’t actually get killed?

I’ll be very careful in testing this :-)

Simon.

From: "MacMullan, Hugh" <hughmac at wharton.upenn.edu<mailto:hughmac at wharton.upenn.edu>>
Date: Thursday, 30 July 2015 16:20
To: Simon Andrews <simon.andrews at babraham.ac.uk<mailto:simon.andrews at babraham.ac.uk>>, "users at gridengine.org<mailto:users at gridengine.org>" <users at gridengine.org<mailto:users at gridengine.org>>
Subject: RE: Monitoring slot usage

Hi Simon:

We use 'Core Binding' to restrict users to the same number of cores as slots requested.

http://www.gridengine.eu/grid-engine-internals/87-exploiting-the-grid-engine-core-binding-feature

We use a jsv to assign the binding value (force compliance) based on the other job inputs: single slot and MPI jobs are bound to 1 core (for each slot requested), OpenMP jobs are bound to the number of slots requested in the pe option.

Or you might be able to just put '-binding linear:1' in $SGE_ROOT/default/common/sge_request, and then have users specify '-binding linear:#' if they're doing a SMP job.

Test carefully! :)

-Hugh

From: users-bounces at gridengine.org<mailto:users-bounces at gridengine.org> [mailto:users-bounces at gridengine.org] On Behalf Of Simon Andrews
Sent: Thursday, July 30, 2015 11:01 AM
To: users at gridengine.org<mailto:users at gridengine.org>
Subject: [gridengine users] Monitoring slot usage

What is the recommended way of identifying jobs which are consuming more CPU than they’ve requested?  I have an environment set up where people mostly submit SMP jobs through a parallel environment and we can use this information to schedule them appropriately.  We’ve had several cases though where the jobs have used significantly more cores on the machine they’re assigned to than they requested, so the nodes become overloaded and go into an alarm state.

What options do I have for monitoring the number of cores simultaneously used by a job and comparing this to the number which were requested so I can find cases where the actual usage is way above the request and kill them?

Thanks

Simon.
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
The Babraham Institute, Babraham Research Campus, Cambridge CB22 3AT Registered Charity No. 1053902.
The information transmitted in this email is directed only to the addressee. If you received this in error, please contact the sender and delete this email from your system. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Babraham Institute. Full conditions at: www.babraham.ac.uk<http://www.babraham.ac.uk/terms>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20150730/fd177c21/attachment.html>


More information about the users mailing list