[gridengine users] cluster utilization
beckerje at mail.nih.gov
Thu Feb 25 14:36:22 UTC 2016
Also check out xdmod:
On Thu, Feb 25, 2016 at 09:51:51AM +0100, RDlab wrote:
>I would suggest that you take a look at S-GAE. It gathers data from qactt and display information using eye-candy graphics for user, queue and whole cluster. It shows the process memory usage, averages, queue wait time???.
>By the way, it is free software under GNU license and we are really happy with it :)
>RDlab (Campus Nord - UPC) -- http://rdlab.cs.upc.edu
>C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
>Telf: +34 93 413 78 20
>> El 24 feb 2016, a las 21:22, bergman at merctech.com escribió:
>> Is anyone monitoring cluster utilization with a higher-level view than
>> simply job (qacct) stastics and CPU-seconds used/available?
>> I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and
>> 200~350K jobs/month and I'm seeking ways to understand the utilization &
>> resource constraints in our cluster overall.
>> The 'jobstats' script is fine for giving feedback to users, looking
>> things like avg/high/low job runtime, wait time, etc., but it doesn't
>> give good information about overall cluster utilization.
>> I'd like to see these kind of metrics on cluster use:
>> histogram of CPU utilization, ie:
>> Utilization Time
>> 100% 5%
>> 90% 20%
>> histogram of overall memory use, ie:
>> Utilization Time
>> 100% 0%
>> 90% 60%
>> correlation between jobs waiting (CPUs idle) and available memory, as
>> Jan 1 14:00 - 20:00
>> avg 4GB free/node
>> avg 50% CPU-slots used
>> avg 12GB RAM request for jobs in 'qw'
>> memory is constraint, cluster is fully
>> utilized but CPUs are idle
>> Jan 8 08:00 - 14:00
>> avg 32GB free/node
>> avg 98% CPU-slots used
>> avg 2GB RAM request for jobs in 'qw'
>> CPU is constraint, cluster is fully
>> utilized but memory is unused
>> number of jobs queued/waiting (excluding 'hold' jobs)
>> number of CPUs requested vs [CPU time/wallclock time]
>> (useful for detecting if users are requesting multiple
>> cores in the 'threaded' PE but running single-threaded
>> amount of memory used per job as a function of request, ie:
>> requested used avg
>> ========= ========
>> 4GB 2.1GB
>> 12GB 9GB
>> 20GB 17GB
>> average duration job spends in 'qw' state
>> duration of queue time as a function of number of CPUs requested, ie
>> 1CPU 1hr avg in 'qw'
>> 2CPU 2hr avg in 'qw'
>> 4CPU 12hr avg in 'qw'
>> duration of queue time as a function of amount of RAM requested
>> 4GB 1hr avg in 'qw'
>> 12GB 2hr avg in 'qw'
>> 20GB 12hr avg in 'qw'
>> I think that the only way to get this information would be to run 'qstats'
>> periodically, capture & process that data....any better suggestions or
>> scripts that anyone can share?
>> users mailing list
>> users at gridengine.org
>users mailing list
>users at gridengine.org
Jesse Becker (Contractor)
More information about the users