[gridengine users] Scheduler Performance

Mark Dixon m.c.dixon at leeds.ac.uk
Mon Mar 14 15:51:07 UTC 2011


On Mon, 14 Mar 2011, Esztermann, Ansgar wrote:

> Hi List,
>
> can anyone give me a hint as to what scheduler performance to expect, 
> and what would typically be the bottleneck? We have 6.2u5 running here, 
> and one scheduler run takes about 5 minutes (with 600 jobs and 800 
> nodes).
>
> From what I've seen with params monitor=1 and strace, the scheduler[1] 
> has a list of running jobs almost instantaneously, then spends about 
> four minutes at 100% CPU writing nothing to common/schedule (and 
> actually not doing any system calls but futex() and write (stdout). 
> During that time, it spews a lot of diagnostic messages about resource 
> utilization to stdout (see below[2]). Finally, reservations are made 
> (they take about four seconds each, which is not exactly fast, but quite 
> manageable), and jobs are started (very quickly).
>
> Is such a long delay between the :RUNNING: and :RESERVING: lines normal? 
> I've thought our disk may be at fault here -- /var is often maxed out in 
> terms of bandwidth. But then again, the thread with 100% CPU doesn't do 
> any read() calls.
...

You're running at a bigger scale than we are (~420 hosts) but...

I/O on the $SGE_ROOT directory can certainly cause the problems you 
report. I would take a look at what your disks are doing with "iostat -x" 
if I were you. You might see a large number of small I/O requests: we 
certainly did.

* If $SGE_ROOT is not local to the qmaster, MONITOR=1 can itself generate 
a large number of small I/Os and be a significant contributor to the 
problem. Replacing common/schedule with a symlink to a disk local to the 
qmaster resolved many "slow running" problems for us.

* Do your compute nodes spool to local disk, or to an NFS share?
("qconf -sconf | grep execd_spool_dir")

* Is $SGE_ROOT local to the qmaster?

* Are you using classic or BDB spooling?

Mark
-- 
-----------------------------------------------------------------
Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------


More information about the users mailing list