[gridengine users] cluster utilization

bergman at merctech.com bergman at merctech.com
Wed Feb 24 20:22:41 UTC 2016

Is anyone monitoring cluster utilization with a higher-level view than
simply job (qacct) stastics and CPU-seconds used/available?

I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
200~350K jobs/month and I'm seeking ways to understand the utilization &
resource constraints in our cluster overall.

The 'jobstats' script is fine for giving feedback to users, looking
things like avg/high/low job runtime, wait time, etc., but it doesn't
give good information about overall cluster utilization.

I'd like to see these kind of metrics on cluster use:

	histogram of CPU utilization, ie:
		Utilization	Time
		100%		5%
		 90%		20%

	histogram of overall memory use, ie:
		Utilization	Time
		100%		0%
		 90%		60%

	correlation between jobs waiting (CPUs idle) and available memory, as
		Jan 1	14:00 - 20:00
			avg 4GB free/node
			avg 50% CPU-slots used
			avg 12GB RAM request for jobs in 'qw'
				memory is constraint, cluster is fully
				utilized but CPUs are idle

		Jan 8	08:00 - 14:00
			avg 32GB free/node
			avg 98% CPU-slots used
			avg 2GB RAM request for jobs in 'qw'
				CPU is constraint, cluster is fully
				utilized but memory is unused

	number of jobs queued/waiting (excluding 'hold' jobs)

	number of CPUs requested vs [CPU time/wallclock time]
		(useful for detecting if users are requesting multiple
		cores in the 'threaded' PE but running single-threaded

	amount of memory used per job as a function of request, ie:
		requested 	used avg
		=========	========
		4GB		2.1GB
		12GB		9GB
		20GB		17GB

	average duration job spends in 'qw' state

	duration of queue time as a function of number of CPUs requested, ie
		1CPU	1hr avg in 'qw'
		2CPU	2hr avg in 'qw'
		4CPU	12hr avg in 'qw'

	duration of queue time as a function of amount of RAM requested
		4GB	1hr avg in 'qw'
		12GB	2hr avg in 'qw'
		20GB	12hr avg in 'qw'

I think that the only way to get this information would be to run 'qstats'
periodically, capture & process that data....any better suggestions or
scripts that anyone can share?



More information about the users mailing list