[gridengine users] Queue dropped because it is full, except it is not

John_Tai John_Tai at smics.com
Thu Apr 20 06:28:51 UTC 2017


>> The queue is also defined as being "qtype INTERACTIVE"?

Yes both interactive and batch.

>> And only a load of 7.75?

That was the current load.

>> Are there any consumable resource requests? I.e. is the memory perhaps fully used up by the already running jobs (being it h_vmem, virtual-free or any other consumable)?

Jobs are not submitted with any consumable requests. Though I have set virtual_free as a complex.

>> Did you upgrade all nodes?

I did upgrade all exec hosts.

Here are error messages from the master:

04/20/2017 14:28:07|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:08|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000
04/20/2017 14:28:08|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:08|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:08|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:09|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000
04/20/2017 14:28:09|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:09|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:09|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:10|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000
04/20/2017 14:28:10|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:10|worker|ibm068|W|Skipping remaining 32 orders
04/20/2017 14:28:10|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:11|worker|ibm068|E|host load value "virtual_free" exceeded: capacity is 20690952192.524288, job 5066074 requests additional 21474836480.000000
04/20/2017 14:28:11|worker|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run
04/20/2017 14:28:11|worker|ibm068|W|Skipping remaining 33 orders
04/20/2017 14:28:11|schedu|ibm068|E|cannot start job 5066074.1, as resources have changed during a scheduling run



-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de]
Sent: Wednesday, April 19, 2017 7:26
To: John_Tai
Cc: users at gridengine.org
Subject: Re: [gridengine users] Queue dropped because it is full, except it is not

Hi,

> Am 19.04.2017 um 09:00 schrieb John_Tai <John_Tai at smics.com>:
>
> I am trying to submit a job to a specific host in the queue:
>
> # qrsh -verbose -q gui.q at ibm056
> Your job 5049542 ("QRLOGIN") has been submitted waiting for
> interactive job to be scheduled ...
>
>
> However it is in waiting state:
>
> # qstat -u johnt
> job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 5049542 0.55500 QRLOGIN    johnt        qw    04/19/2017 14:51:19                                    1

The queue is also defined as being "qtype INTERACTIVE"?


> # qstat -j 5049542 |grep gui.q
> hard_queue_list:            gui.q at ibm056
>                             queue instance "gui.q at dsbm05" dropped
> because it is full
>
> Here is the current status of the queue:
>
> # qstat -f |grep gui.q
> gui.q at dsbm04                   BIP   0/5/45         8.87     lx24-amd64
> gui.q at dsbm05                   BIP   0/55/55        7.75     lx24-amd64

And only a load of 7.75?


> gui.q at ibm056                   BIP   0/11/30        3.15     lx24-amd64

Are there any consumable resource requests? I.e. is the memory perhaps fully used up by the already running jobs (being it h_vmem, virtual-free or any other consumable)?


> gui.q at ibm057                   BIP   0/11/30        1.34     lx24-amd64
> gui.q at ibm058                   BIP   0/11/45        3.47     lx24-amd64
>
>
> The same goes for ibm057 and ibm058. It seems that dsbm05 being full blocks all following servers in the queue list. In fact I can submit to dsbm04, which precedes dsbm05.
>
> I recently upgraded from sge6.1 to sge6.2u6, though I can’t be sure that’s the only thing that’s changed. How do I even begin to debug this?

Did you upgrade all nodes?

-- Reuti
________________________________

This email (including its attachments, if any) may be confidential and proprietary information of SMIC, and intended only for the use of the named recipient(s) above. Any unauthorized use or disclosure of this email is strictly prohibited. If you are not the intended recipient(s), please notify the sender immediately and delete this email from your computer.




More information about the users mailing list