[gridengine users] sge_schedd exhausts all memory
jlb at salilab.org
Tue Oct 25 17:30:36 UTC 2011
On Tue, 25 Oct 2011 at 6:12pm, SLIM H.A. wrote
> After using GridEngine 6.1u6 for more than a year a problem has cropped
> up suddenly with the scheduler. The scheduler uses rapidly all the
> available memory in the system and can ultimately crash the server.
> Stopping qmaster, waiting until top shows a normal memory usage and
> restarting it, immediately all memory is claimed by sge_schedd. I have
> tried setting the params profile=1 setting with qconf -msconf to
> monitor the scheduler message file, the output after restarting qmaster
> is below. I cannot see anything relevant but maybe someone else has a
> better insight.
> Does anyone know another way to investigate this "memory leak"?
I recently dealt with a similar problem on 6.1u3. I tracked it down to a
single job -- a 50,000 task array job with a very poorly written job
script which clocked in at over 32MB. Putting a hold on that job settled
SGE back into sane amounts of memory usage. I then gently encouraged the
user to rewrite the job script.
One way to track down which job(s) is/are causing the issue is to put a
hold on all queued jobs. Take the hold off in batches and track down the
QB3 Shared Cluster Sysadmin
More information about the users