[gridengine users] cluster utilization

Jesse Becker beckerje at mail.nih.gov
Thu Feb 25 14:36:22 UTC 2016


Also check out xdmod:
    http://xdmod.sourceforge.net/

On Thu, Feb 25, 2016 at 09:51:51AM +0100, RDlab wrote:
>Hello,
>
>I would suggest that you take a look at S-GAE. It gathers data from qactt and display information using eye-candy graphics for user, queue and whole cluster. It shows the process memory usage, averages, queue wait time???.
>
>By the way, it is free software under GNU license and we are really happy with it :)
>
>http://rdlab.cs.upc.edu/s-gae
>
>
>Best regards,
>
>Gabriel
>
>-- 
>RDlab (Campus Nord - UPC)  --  http://rdlab.cs.upc.edu
>C/ Jordi Girona 1-3. Edifici Omega, Despatx 005
>08034 Barcelona
>
>Telf:	+34 93 413 78 20
>
>> El 24 feb 2016, a las 21:22, bergman at merctech.com escribió:
>>
>> Is anyone monitoring cluster utilization with a higher-level view than
>> simply job (qacct) stastics and CPU-seconds used/available?
>>
>> I'm running SoGE 8.1.6  on a cluster with ~70 nodes, ~1400 cores and
>> 200~350K jobs/month and I'm seeking ways to understand the utilization &
>> resource constraints in our cluster overall.
>>
>> The 'jobstats' script is fine for giving feedback to users, looking
>> things like avg/high/low job runtime, wait time, etc., but it doesn't
>> give good information about overall cluster utilization.
>>
>>
>> I'd like to see these kind of metrics on cluster use:
>>
>> 	histogram of CPU utilization, ie:
>> 		Utilization	Time
>> 		100%		5%
>> 		 90%		20%
>>
>> 	histogram of overall memory use, ie:
>> 		Utilization	Time
>> 		100%		0%
>> 		 90%		60%
>>
>> 	correlation between jobs waiting (CPUs idle) and available memory, as
>> 	in:
>> 		Jan 1	14:00 - 20:00
>> 			avg 4GB free/node
>> 			avg 50% CPU-slots used
>> 			avg 12GB RAM request for jobs in 'qw'
>> 				memory is constraint, cluster is fully
>> 				utilized but CPUs are idle
>>
>> 		Jan 8	08:00 - 14:00
>> 			avg 32GB free/node
>> 			avg 98% CPU-slots used
>> 			avg 2GB RAM request for jobs in 'qw'
>> 				CPU is constraint, cluster is fully
>> 				utilized but memory is unused
>>
>>
>> 	number of jobs queued/waiting (excluding 'hold' jobs)
>>
>> 	number of CPUs requested vs [CPU time/wallclock time]
>> 		(useful for detecting if users are requesting multiple
>> 		cores in the 'threaded' PE but running single-threaded
>> 		jobs)
>>
>> 	amount of memory used per job as a function of request, ie:
>> 		requested 	used avg
>> 		=========	========
>> 		4GB		2.1GB
>> 		12GB		9GB
>> 		20GB		17GB
>>
>> 	average duration job spends in 'qw' state
>>
>> 	duration of queue time as a function of number of CPUs requested, ie
>> 		1CPU	1hr avg in 'qw'
>> 		2CPU	2hr avg in 'qw'
>> 		4CPU	12hr avg in 'qw'
>>
>> 	duration of queue time as a function of amount of RAM requested
>> 		4GB	1hr avg in 'qw'
>> 		12GB	2hr avg in 'qw'
>> 		20GB	12hr avg in 'qw'
>>
>> I think that the only way to get this information would be to run 'qstats'
>> periodically, capture & process that data....any better suggestions or
>> scripts that anyone can share?
>>
>> Thanks,
>>
>> Mark
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
>
>_______________________________________________
>users mailing list
>users at gridengine.org
>https://gridengine.org/mailman/listinfo/users

-- 
Jesse Becker (Contractor)



More information about the users mailing list