[gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

Feng Zhang prod.feng at gmail.com
Fri Jan 23 21:43:57 UTC 2015


It looks normal to me.

Can you run:

qhost -F mem_free  -l mem_free=80g

to see if it can list the nodes properly?

On Fri, Jan 23, 2015 at 4:35 PM, Ian Kaufman <ikaufman at eng.ucsd.edu> wrote:
> So it is requestable, but not consumable, and there is no default set
> in the complex. Well, the default is set to zero, but I don't think
> that is treated as a default.
>
> Is that what was intended - requestable but not consumable?
>
> Ian
>
> On Fri, Jan 23, 2015 at 12:36 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>> Natually, it does:
>>
>>> qconf -sc | grep mem_free
>> mem_free            mf                MEMORY      <= YES         NO
>> 0        0
>>
>> And it is reported on all nodes:
>>
>>> qhost -F mem_free -h gpu001
>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE SWAPTO
>> SWAPUS
>> -------------------------------------------------------------------------------
>> global                  -               -     -       - -       -       -
>> gpu001                         lx24-amd64     16  3.32  126.1G 37.2G    4.0G
>> 0.0
>>     Host Resource(s):      hl:mem_free=88.885G
>>
>> And everything was working until a week ago.
>>
>> Ilya.
>>
>> -------- Original Message --------
>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>> value of memory type: SGE reports it as unknown resource
>> From: Ian Kaufman <ikaufman at eng.ucsd.edu>
>> To: Ilya M <4ilya.m+grid at gmail.com>
>> Date: 1/23/15, 11:38 AM
>>>
>>> Is mem_free defined in the host complex_values? What does
>>>
>>> qconf -sc | grep mem_free
>>>
>>> show? Is there a default value defined?
>>>
>>> Ian
>>>
>>> On Fri, Jan 23, 2015 at 11:30 AM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>>
>>>> Because I am testing with qsub -w v, the jobs is not accepted for
>>>> scheduling, job id is not generated, and qstat -j will not work. The
>>>> output
>>>> of qsub is as I showed in the original email:
>>>>
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu001" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu002" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu003" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu004" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu005" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu006" because
>>>> job
>>>> requests unknown resource (mem_free)
>>>> ...
>>>>
>>>> Ilya.
>>>>
>>>>
>>>> -------- Original Message --------
>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>> value of memory type: SGE reports it as unknown resource
>>>> From: Feng Zhang <prod.feng at gmail.com>
>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>> Date: 1/23/15, 9:27 AM
>>>>>
>>>>> Llya,
>>>>>
>>>>> Can you please run:
>>>>>
>>>>> qstat -j <jobid>
>>>>>
>>>>> and past the output here? It may be useful for checking the problem
>>>>>
>>>>> On Fri, Jan 23, 2015 at 12:08 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>>>>
>>>>>> Removed the quota limits. To no avail: same problems.
>>>>>>
>>>>>>
>>>>>> -------- Original Message --------
>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>>>> Date: 1/23/15, 2:33 AM
>>>>>>>
>>>>>>> Can you remove them temporarily? I saw cases where suddenly the
>>>>>>> "unknown
>>>>>>> resource" popped up - and also suddenly vanished again, but it was
>>>>>>> somehow
>>>>>>> connected to RQS was my conclusion.
>>>>>>>
>>>>>>> -- Reuti
>>>>>>>
>>>>>>>
>>>>>>>> Am 23.01.2015 um 00:16 schrieb Ilya M <4ilya.m+grid at gmail.com>:
>>>>>>>>
>>>>>>>> There are two RQS, one is disabled:
>>>>>>>>
>>>>>>>> {
>>>>>>>>      name         limit_for_interns
>>>>>>>>      description  "limit to max 5 GPU jobs per intern."
>>>>>>>>      enabled      TRUE
>>>>>>>>      limit        users {int1,int2} hosts @gpu to slots=5
>>>>>>>> }
>>>>>>>> {
>>>>>>>>      name         limit_slots
>>>>>>>>      description  NONE
>>>>>>>>      enabled      FALSE
>>>>>>>>      limit        hosts {@gpu} to slots=2
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> -------- Original Message --------
>>>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a
>>>>>>>> load
>>>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>>>> To: Ilya <4ilya.m+grid at gmail.com>
>>>>>>>> Date: 1/21/15, 16:12
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Am 22.01.2015 um 00:52 schrieb Ilya:
>>>>>>>>>
>>>>>>>>>> Something happened to the SGE (6.2u5) that had been running fine
>>>>>>>>>> for
>>>>>>>>>> many months, and users can no longer put resource requests for load
>>>>>>>>>> values
>>>>>>>>>> if they are of memory type, e.g.
>>>>>>>>>>
>>>>>>>>>> qsub -l mem_free=5G -w v .... produces the following output:
>>>>>>>>>>
>>>>>>>>>> cannot run in queue "gpu.q at gpu038" because job requests unknown
>>>>>>>>>> resource (mem_free)
>>>>>>>>>>
>>>>>>>>>> The resource is available, though, when querying for it:
>>>>>>>>>> qhost -F mem_free -h gpu038
>>>>>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>>>>> SWAPTO
>>>>>>>>>> SWAPUS
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>>>> global                  -               -     -       - -       -
>>>>>>>>>> -
>>>>>>>>>> gpu038                         lx24-amd64     16  2.11  126.1G
>>>>>>>>>> 15.7G
>>>>>>>>>> 4.0G     0.0
>>>>>>>>>>       Host Resource(s):      hl:mem_free=110.416G
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This was first reported by a user when he tried to request custom
>>>>>>>>>> "hl"
>>>>>>>>>> resource. However, it now appears that all "hl" resources of type
>>>>>>>>>> "memory"
>>>>>>>>>> show this behavior. Integer "hl" are OK.
>>>>>>>>>
>>>>>>>>> Do you have any RQS in place?
>>>>>>>>>
>>>>>>>>> -- Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I bounced qmaster between master and shadow-master a couple of
>>>>>>>>>> times,
>>>>>>>>>> but it did not resolve the problem.
>>>>>>>>>>
>>>>>>>>>> Additionally, when I added MONITOR=1 to scheduler's configuration,
>>>>>>>>>> the
>>>>>>>>>> file $SGE_ROOT/$SGE_CELL/common/schedule contains only colons:
>>>>>>>>>> ::::::::
>>>>>>>>>> ::::::::
>>>>>>>>>> ::::::::
>>>>>>>>>>
>>>>>>>>>> Any ideas?
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users at gridengine.org
>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users at gridengine.org
>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users at gridengine.org
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users at gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
>
>
> --
> Ian Kaufman
> Research Systems Administrator
> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users



-- 
Best,

Feng



More information about the users mailing list