[gridengine users] Long delay starting jobs, even when compute nodes are empty
Reuti
reuti at staff.uni-marburg.de
Thu Mar 10 21:03:52 UTC 2011
Hi,
Am 10.03.2011 um 20:04 schrieb Lane Schwartz:
> Lately I've noticed that many of my jobs take much longer than
> expected (sometimes up to half an hour) to go from pending to
> running, even when there are numerous nodes with sufficient resources
> available. Right now, for example, I've got a couple dozen jobs in
> pending, and 38 nodes where no jobs are running.
>
> I was wondering if anyone might be able to shed some light on why this
> might be. As I said, there are plenty of nodes with sufficient
> resources available to run the pending jobs, but they sometimes take a
> long time to go from pending to running.
>
> For reference, mem_free is set to consumable, and my jobs use the
> default value of 4GB for their requested mem_free. There are some
> other users' jobs which request more memory than that.
>
> The only clue I've been able to find is from examining the qmaster
> messages log file. It has lots of lines that look like the errors
> below:
>
> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
> capacity is 66765959168.262146, job 495795 requests additional
> 68719476736.000000
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run
> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
> resources have changed during a scheduling run
- are these serial or parallel jobs?
- do you use resource reservation for the mem_free request, as otherwise smaller ones with a lower request may slip in all the time?
-- Reuti
> Any tips or pointers would be appreciated.
>
> Thanks,
> Lane
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
More information about the users
mailing list