[gridengine users] Long delay starting jobs, even when compute nodes are empty
Reuti
reuti at Staff.Uni-Marburg.DE
Fri Mar 11 13:58:18 UTC 2011
Am 11.03.2011 um 14:49 schrieb Lane Schwartz:
> Rayson,
>
> Thanks for the pointer. In the qmon scheduler configuration, I have
> "Job Scheduling Information" set to true. I assume that's the same
> setting you're refering to?
>
> With this setting enabled, I still don't get very much info. When I
> run qstat -j on my jobs, the only thing it tells me is that a queue
> instance for a particular node is dropped because that node is
> disabled.
Because "disabled"? Did someone use `qmon` to disable the node or set up any calendar?
-- Reuti
> Thanks,
> Lane
>
> On Thu, Mar 10, 2011 at 4:28 PM, Rayson Ho <rayrayson at gmail.com> wrote:
>> Turn on "schedd_job_info", and run qstat -j to see why the scheduler
>> is not assigning jobs.
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>>
>> Rayson
>>
>>
>>
>> On Thu, Mar 10, 2011 at 2:04 PM, Lane Schwartz <dowobeha at gmail.com> wrote:
>>> Hi,
>>>
>>> Lately I've noticed that many of my jobs take much longer than
>>> expected (sometimes up to half an hour) to go from pending to
>>> running, even when there are numerous nodes with sufficient resources
>>> available. Right now, for example, I've got a couple dozen jobs in
>>> pending, and 38 nodes where no jobs are running.
>>>
>>> I was wondering if anyone might be able to shed some light on why this
>>> might be. As I said, there are plenty of nodes with sufficient
>>> resources available to run the pending jobs, but they sometimes take a
>>> long time to go from pending to running.
>>>
>>> For reference, mem_free is set to consumable, and my jobs use the
>>> default value of 4GB for their requested mem_free. There are some
>>> other users' jobs which request more memory than that.
>>>
>>> The only clue I've been able to find is from examining the qmaster
>>> messages log file. It has lots of lines that look like the errors
>>> below:
>>>
>>> 03/10/2011 13:56:00|worker|t3n2|E|host load value "mem_free" exceeded:
>>> capacity is 66765959168.262146, job 495795 requests additional
>>> 68719476736.000000
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>>> 03/10/2011 13:56:00|worker|t3n2|W|Skipping 108 remaining orders
>>> 03/10/2011 13:56:00|worker|t3n2|E|cannot start job 495795.1, as
>>> resources have changed during a scheduling run
>>>
>>> Any tips or pointers would be appreciated.
>>>
>>> Thanks,
>>> Lane
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>
>>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away. It is time to go elsewhere. The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
More information about the users
mailing list