[gridengine users] sge_schedd exhausts all memory

Joshua Baker-LePain jlb at salilab.org
Tue Oct 25 17:30:36 UTC 2011


On Tue, 25 Oct 2011 at 6:12pm, SLIM H.A. wrote

> After using GridEngine 6.1u6 for more than a year a problem has cropped
> up suddenly with the scheduler. The scheduler uses rapidly all the
> available memory in the system and can ultimately crash the server.
> Stopping qmaster, waiting until top shows a normal memory usage and
> restarting it, immediately all memory is claimed by sge_schedd. I have
> tried setting the params  profile=1 setting with qconf -msconf to
> monitor the scheduler message file, the output after restarting qmaster
> is below. I cannot see anything relevant but maybe someone else has a
> better insight.
>
> Does anyone know another way to investigate this "memory leak"?

I recently dealt with a similar problem on 6.1u3.  I tracked it down to a 
single job -- a 50,000 task array job with a very poorly written job 
script which clocked in at over 32MB.  Putting a hold on that job settled 
SGE back into sane amounts of memory usage.  I then gently encouraged the 
user to rewrite the job script.

One way to track down which job(s) is/are causing the issue is to put a 
hold on all queued jobs.  Take the hold off in batches and track down the 
errant job(s).

Good luck.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF



More information about the users mailing list