[gridengine users] cluster utilization
rdlab at cs.upc.edu
Thu Feb 25 08:51:51 UTC 2016
I would suggest that you take a look at S-GAE. It gathers data from qactt and display information using eye-candy graphics for user, queue and whole cluster. It shows the process memory usage, averages, queue wait time….
By the way, it is free software under GNU license and we are really happy with it :)
RDlab (Campus Nord - UPC) -- http://rdlab.cs.upc.edu
C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
Telf: +34 93 413 78 20
> El 24 feb 2016, a las 21:22, bergman at merctech.com escribió:
> Is anyone monitoring cluster utilization with a higher-level view than
> simply job (qacct) stastics and CPU-seconds used/available?
> I'm running SoGE 8.1.6 on a cluster with ~70 nodes, ~1400 cores and
> 200~350K jobs/month and I'm seeking ways to understand the utilization &
> resource constraints in our cluster overall.
> The 'jobstats' script is fine for giving feedback to users, looking
> things like avg/high/low job runtime, wait time, etc., but it doesn't
> give good information about overall cluster utilization.
> I'd like to see these kind of metrics on cluster use:
> histogram of CPU utilization, ie:
> Utilization Time
> 100% 5%
> 90% 20%
> histogram of overall memory use, ie:
> Utilization Time
> 100% 0%
> 90% 60%
> correlation between jobs waiting (CPUs idle) and available memory, as
> Jan 1 14:00 - 20:00
> avg 4GB free/node
> avg 50% CPU-slots used
> avg 12GB RAM request for jobs in 'qw'
> memory is constraint, cluster is fully
> utilized but CPUs are idle
> Jan 8 08:00 - 14:00
> avg 32GB free/node
> avg 98% CPU-slots used
> avg 2GB RAM request for jobs in 'qw'
> CPU is constraint, cluster is fully
> utilized but memory is unused
> number of jobs queued/waiting (excluding 'hold' jobs)
> number of CPUs requested vs [CPU time/wallclock time]
> (useful for detecting if users are requesting multiple
> cores in the 'threaded' PE but running single-threaded
> amount of memory used per job as a function of request, ie:
> requested used avg
> ========= ========
> 4GB 2.1GB
> 12GB 9GB
> 20GB 17GB
> average duration job spends in 'qw' state
> duration of queue time as a function of number of CPUs requested, ie
> 1CPU 1hr avg in 'qw'
> 2CPU 2hr avg in 'qw'
> 4CPU 12hr avg in 'qw'
> duration of queue time as a function of amount of RAM requested
> 4GB 1hr avg in 'qw'
> 12GB 2hr avg in 'qw'
> 20GB 12hr avg in 'qw'
> I think that the only way to get this information would be to run 'qstats'
> periodically, capture & process that data....any better suggestions or
> scripts that anyone can share?
> users mailing list
> users at gridengine.org
More information about the users