[gridengine users] Scheduler Performance
rayrayson at gmail.com
Mon Mar 14 18:28:55 UTC 2011
By design, the scheduler (the scheduler thread in 6.2) is CPU bound,
while the qmaster (excluding the scheduler thread) is mostly I/O
(disk, network, etc) bound. If there is a thread at 100% CPU for 4
minutes, and only very few random I/O operations, then it is very
likely that it is the scheduler thread.
Can you turn off qmaster profiling, and turn on scheduler profiling??
You can enable it by setting scheduler config "PROFILE=TRUE" or
"PROFILE=1". You will then get the time each stage spends, something
PROF: job-order calculation took 0.020 s
You can get more info from doc/devel/rfe/profiling.txt if you have the
source, or online at the Grid Scheduler homepage:
On Mon, Mar 14, 2011 at 1:25 PM, Esztermann, Ansgar
<Ansgar.Esztermann at mpi-bpc.mpg.de> wrote:
>> I/O on the $SGE_ROOT directory can certainly cause the problems you
>> report. I would take a look at what your disks are doing with "iostat -x"
>> if I were you. You might see a large number of small I/O requests: we
>> certainly did.
> There are many small requests, but they seem to be on /var, not $SGE_ROOT. Of course, this might be caused by some process apart from SGE. Our cluster management software uses MySQL, and that's using /var as well.
>> * If $SGE_ROOT is not local to the qmaster, MONITOR=1 can itself generate
>> a large number of small I/Os and be a significant contributor to the
>> problem. Replacing common/schedule with a symlink to a disk local to the
>> qmaster resolved many "slow running" problems for us.
>> * Do your compute nodes spool to local disk, or to an NFS share?
>> ("qconf -sconf | grep execd_spool_dir")
>> * Is $SGE_ROOT local to the qmaster?
> I was about to write "yes", but that's not entirely true. It's on drbd.
>> * Are you using classic or BDB spooling?
> Ansgar Esztermann
> Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
> users mailing list
> users at gridengine.org
More information about the users