[gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

Ilya M 4ilya.m+grid at gmail.com
Fri Jan 23 20:36:53 UTC 2015


Natually, it does:

 > qconf -sc | grep mem_free
mem_free            mf                MEMORY      <= YES         
NO         0        0

And it is reported on all nodes:

 > qhost -F mem_free -h gpu001
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE SWAPTO  
SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       - -       -       -
gpu001                         lx24-amd64     16  3.32  126.1G 37.2G    
4.0G     0.0
     Host Resource(s):      hl:mem_free=88.885G

And everything was working until a week ago.

Ilya.

-------- Original Message --------
Subject: Re: [gridengine users] Cannot request resource if it is a load 
value of memory type: SGE reports it as unknown resource
From: Ian Kaufman <ikaufman at eng.ucsd.edu>
To: Ilya M <4ilya.m+grid at gmail.com>
Date: 1/23/15, 11:38 AM
> Is mem_free defined in the host complex_values? What does
>
> qconf -sc | grep mem_free
>
> show? Is there a default value defined?
>
> Ian
>
> On Fri, Jan 23, 2015 at 11:30 AM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>> Because I am testing with qsub -w v, the jobs is not accepted for
>> scheduling, job id is not generated, and qstat -j will not work. The output
>> of qsub is as I showed in the original email:
>>
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu001" because job
>> requests unknown resource (mem_free)
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu002" because job
>> requests unknown resource (mem_free)
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu003" because job
>> requests unknown resource (mem_free)
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu004" because job
>> requests unknown resource (mem_free)
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu005" because job
>> requests unknown resource (mem_free)
>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu006" because job
>> requests unknown resource (mem_free)
>> ...
>>
>> Ilya.
>>
>>
>> -------- Original Message --------
>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>> value of memory type: SGE reports it as unknown resource
>> From: Feng Zhang <prod.feng at gmail.com>
>> To: Ilya M <4ilya.m+grid at gmail.com>
>> Date: 1/23/15, 9:27 AM
>>> Llya,
>>>
>>> Can you please run:
>>>
>>> qstat -j <jobid>
>>>
>>> and past the output here? It may be useful for checking the problem
>>>
>>> On Fri, Jan 23, 2015 at 12:08 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>> Removed the quota limits. To no avail: same problems.
>>>>
>>>>
>>>> -------- Original Message --------
>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>> value of memory type: SGE reports it as unknown resource
>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>> Date: 1/23/15, 2:33 AM
>>>>> Can you remove them temporarily? I saw cases where suddenly the "unknown
>>>>> resource" popped up - and also suddenly vanished again, but it was
>>>>> somehow
>>>>> connected to RQS was my conclusion.
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> Am 23.01.2015 um 00:16 schrieb Ilya M <4ilya.m+grid at gmail.com>:
>>>>>>
>>>>>> There are two RQS, one is disabled:
>>>>>>
>>>>>> {
>>>>>>      name         limit_for_interns
>>>>>>      description  "limit to max 5 GPU jobs per intern."
>>>>>>      enabled      TRUE
>>>>>>      limit        users {int1,int2} hosts @gpu to slots=5
>>>>>> }
>>>>>> {
>>>>>>      name         limit_slots
>>>>>>      description  NONE
>>>>>>      enabled      FALSE
>>>>>>      limit        hosts {@gpu} to slots=2
>>>>>> }
>>>>>>
>>>>>>
>>>>>> -------- Original Message --------
>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>> To: Ilya <4ilya.m+grid at gmail.com>
>>>>>> Date: 1/21/15, 16:12
>>>>>>> Hi,
>>>>>>>
>>>>>>> Am 22.01.2015 um 00:52 schrieb Ilya:
>>>>>>>
>>>>>>>> Something happened to the SGE (6.2u5) that had been running fine for
>>>>>>>> many months, and users can no longer put resource requests for load
>>>>>>>> values
>>>>>>>> if they are of memory type, e.g.
>>>>>>>>
>>>>>>>> qsub -l mem_free=5G -w v .... produces the following output:
>>>>>>>>
>>>>>>>> cannot run in queue "gpu.q at gpu038" because job requests unknown
>>>>>>>> resource (mem_free)
>>>>>>>>
>>>>>>>> The resource is available, though, when querying for it:
>>>>>>>> qhost -F mem_free -h gpu038
>>>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>>> SWAPTO
>>>>>>>> SWAPUS
>>>>>>>>
>>>>>>>>
>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>> global                  -               -     -       - -       -
>>>>>>>> -
>>>>>>>> gpu038                         lx24-amd64     16  2.11  126.1G 15.7G
>>>>>>>> 4.0G     0.0
>>>>>>>>       Host Resource(s):      hl:mem_free=110.416G
>>>>>>>>
>>>>>>>>
>>>>>>>> This was first reported by a user when he tried to request custom
>>>>>>>> "hl"
>>>>>>>> resource. However, it now appears that all "hl" resources of type
>>>>>>>> "memory"
>>>>>>>> show this behavior. Integer "hl" are OK.
>>>>>>> Do you have any RQS in place?
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>> I bounced qmaster between master and shadow-master a couple of times,
>>>>>>>> but it did not resolve the problem.
>>>>>>>>
>>>>>>>> Additionally, when I added MONITOR=1 to scheduler's configuration,
>>>>>>>> the
>>>>>>>> file $SGE_ROOT/$SGE_CELL/common/schedule contains only colons:
>>>>>>>> ::::::::
>>>>>>>> ::::::::
>>>>>>>> ::::::::
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users at gridengine.org
>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users at gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users at gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
>




More information about the users mailing list