[gridengine users] suspend_threshold depending on job I/O

Txema Heredia Genestar txema.heredia at upf.edu
Tue Nov 13 14:24:30 UTC 2012


Hi all,

we have a 300-core cluster with a ~150Tb shared directory (GPFS). Our 
users run some genomic analysis that use huge files and usually cannot 
fit the 500Gb internal HDD of the nodes. As you can imagine, sometimes 
things get pretty intense and all the nagios disk alarms start going off 
(the disk "works" but we got 10+ sec timeouts).

Knowing that I cannot trust our users to request any "disk_intensive" 
parameter/flag, I was pondering on setting a suspend_threshold in the 
queues, watching the shared disk status (e.g. timing an ls to the shared 
disk) and start suspending jobs when the disk has, say, a 3 sec delay. 
This would be a nice fix for our issue, but it has some problems: When 
there are both "IO-intensive" and "normal" jobs, and the 
suspend_threshold kicks in, SGE will start suspending jobs ¿without any 
particular criteria? (I don't know this part), and lots of innocent 
"normal" jobs will be suspended through all the nodes before the disk 
load is stabilized.

Does anyone have any idea/workaround to solve this? Or should I 
ignore/relax all the disk alarms?

Thanks in advance,

Txema


More information about the users mailing list