[gridengine users] cluster utilization

RDlab rdlab at cs.upc.edu
Thu Feb 25 08:51:51 UTC 2016


Hello,

I would suggest that you take a look at S-GAE. It gathers data from qactt and display information using eye-candy graphics for user, queue and whole cluster. It shows the process memory usage, averages, queue wait time….

By the way, it is free software under GNU license and we are really happy with it :)

http://rdlab.cs.upc.edu/s-gae


Best regards,

Gabriel

-- 
RDlab (Campus Nord - UPC)  --  http://rdlab.cs.upc.edu
C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
08034 Barcelona

Telf:	+34 93 413 78 20

> El 24 feb 2016, a las 21:22, bergman at merctech.com escribió:
> 
> Is anyone monitoring cluster utilization with a higher-level view than
> simply job (qacct) stastics and CPU-seconds used/available?
> 
> I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
> 200~350K jobs/month and I'm seeking ways to understand the utilization &
> resource constraints in our cluster overall.
> 
> The 'jobstats' script is fine for giving feedback to users, looking
> things like avg/high/low job runtime, wait time, etc., but it doesn't
> give good information about overall cluster utilization.
> 
> 
> I'd like to see these kind of metrics on cluster use:
> 
> 	histogram of CPU utilization, ie:
> 		Utilization	Time
> 		100%		5%
> 		 90%		20%
> 
> 	histogram of overall memory use, ie:
> 		Utilization	Time
> 		100%		0%
> 		 90%		60%
> 
> 	correlation between jobs waiting (CPUs idle) and available memory, as
> 	in:
> 		Jan 1	14:00 - 20:00
> 			avg 4GB free/node
> 			avg 50% CPU-slots used
> 			avg 12GB RAM request for jobs in 'qw'
> 				memory is constraint, cluster is fully
> 				utilized but CPUs are idle
> 
> 		Jan 8	08:00 - 14:00
> 			avg 32GB free/node
> 			avg 98% CPU-slots used
> 			avg 2GB RAM request for jobs in 'qw'
> 				CPU is constraint, cluster is fully
> 				utilized but memory is unused
> 
> 
> 	number of jobs queued/waiting (excluding 'hold' jobs)
> 
> 	number of CPUs requested vs [CPU time/wallclock time]
> 		(useful for detecting if users are requesting multiple
> 		cores in the 'threaded' PE but running single-threaded
> 		jobs)
> 
> 	amount of memory used per job as a function of request, ie:
> 		requested 	used avg
> 		=========	========
> 		4GB		2.1GB
> 		12GB		9GB
> 		20GB		17GB
> 
> 	average duration job spends in 'qw' state
> 
> 	duration of queue time as a function of number of CPUs requested, ie
> 		1CPU	1hr avg in 'qw'
> 		2CPU	2hr avg in 'qw'
> 		4CPU	12hr avg in 'qw'
> 
> 	duration of queue time as a function of amount of RAM requested
> 		4GB	1hr avg in 'qw'
> 		12GB	2hr avg in 'qw'
> 		20GB	12hr avg in 'qw'
> 
> I think that the only way to get this information would be to run 'qstats'
> periodically, capture & process that data....any better suggestions or
> scripts that anyone can share?
> 
> Thanks,
> 
> Mark
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list