[gridengine users] Debugging *really* long scheduling runs

Joshua Baker-LePain jlb at salilab.org
Fri Nov 1 17:44:06 UTC 2013

I'm currently running Grid Engine 2011.11p1 on CentOS-6.  I'm using 
classic spooling to a local disk, local $SGE_ROOT (except for 
$SGE_ROOT/$SGE_CELL/common), and local spooling directories on the nodes 
(of which there are more than 600).  I'm occasionally seeing *really* long 
scheduling runs (the last two were 4005 and 4847 seconds).  This leads to 
extra fun like:

11/01/2013 08:35:39|event_|sortinghat|W|acknowledge timeout after 600 seconds for event client (schedd:0) on host "$SGE_MASTER"
11/01/2013 08:35:39|event_|sortinghat|E|removing event client (schedd:0) on host "$SGE_MASTER" after acknowledge timeout from event client list

I have "PROFILE=1" set, and of course most of the time is spent in "job 
dispatching".  But I'm really not sure how else to track down the cause of 
this.  Where should I be looking?  Are there any other options I can set 
to get more info?


Joshua Baker-LePain
QB3 Shared Cluster Sysadmin

More information about the users mailing list