[gridengine users] Debugging *really* long scheduling runs
jlb at salilab.org
Sat Nov 2 21:38:55 UTC 2013
On Sat, 2 Nov 2013 at 10:29am, Ed Lauzier wrote
> How many cores do you have on your gridengine master?
The master is a relatively recent and more-than-a-bit overspecced
machine -- 2x Xeon X5687 for a total of 8 physical cores (16 w/ HT) @
3.6GHz and 48GB of RAM. It's one and only function is being the master.
> Do you have any per-host quotas set? ( You need at least 2 cores for
> scheduling decisions to be made involving per-host/per-user quotas in a
> timely manner....)
No per-host quotas, but we do have several project-based quotas.
> How large is your accounting file?
Rather -- 19GB. And it does live on a NAS.
> Any other programs, jobs, people, accessing your accounting file heavily?
> Are you using the gridengine master for anything else besides scheduling
> like copying the common area out in cron with huge accounting files?
As above, the common area lives on our NetApp.
> If you are not already, try and run your scheduler on a 4 core 16 GB
> virtual machine with basic underlying hardware set up for at least
> one 10 GBit/s uplink.
I don't have 10Gb/s on this machine, but it does have 2 bonded 1Gb/s.
> Is your shared nfs area slow or getting hammered by users?
Occasionally, but it doesn't correlate with the scheduling issues, which
are intermittent. For much of yesterday, e.g., the runs were taking <60s.
And then they started creeping up to the point where the last one took
22176s (!) (and sge_qmaster is currently using ~9GB of RAM). And yet I
can't find any job submitted around that time that looks like it would
start to utterly confuse the scheduler.
> Just some points to help you in identifying the issues....
Thanks -- it's appreciated.
QB3 Shared Cluster Sysadmin
More information about the users