[gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

Ilya M 4ilya.m+grid at gmail.com
Tue Jan 27 00:04:54 UTC 2015


Yes, it does list the nodes OK:

 >qhost -F mem_free  -l mem_free=80g
gpu001        lx24-amd64     16  3.24  126.1G   24.9G    4.0G 0.0
     Host Resource(s):      hl:mem_free=101.250G
gpu002        lx24-amd64     16  2.21  126.1G   27.4G    4.0G 0.0
     Host Resource(s):      hl:mem_free=98.770G
gpu003        lx24-amd64     16  3.37  126.1G   42.4G    4.0G 0.0
     Host Resource(s):      hl:mem_free=83.691G
gpu005        lx24-amd64     16  3.41  126.1G   43.7G    4.0G 4.0K
     Host Resource(s):      hl:mem_free=82.429G
gpu007        lx24-amd64     16  2.51  126.1G   21.1G    4.0G 0.0
     Host Resource(s):      hl:mem_free=105.019G
gpu008        lx24-amd64     16  4.03  126.1G   19.9G    4.0G 0.0
     Host Resource(s):      hl:mem_free=106.240G

....

Ilya.


-------- Original Message --------
Subject: Re: [gridengine users] Cannot request resource if it is a load 
value of memory type: SGE reports it as unknown resource
From: Feng Zhang <prod.feng at gmail.com>
To: Ian Kaufman <ikaufman at eng.ucsd.edu>
Date: 1/23/15, 1:43 PM
> It looks normal to me.
>
> Can you run:
>
> qhost -F mem_free  -l mem_free=80g
>
> to see if it can list the nodes properly?
>
> On Fri, Jan 23, 2015 at 4:35 PM, Ian Kaufman <ikaufman at eng.ucsd.edu> wrote:
>> So it is requestable, but not consumable, and there is no default set
>> in the complex. Well, the default is set to zero, but I don't think
>> that is treated as a default.
>>
>> Is that what was intended - requestable but not consumable?
>>
>> Ian
>>
>> On Fri, Jan 23, 2015 at 12:36 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>> Natually, it does:
>>>
>>>> qconf -sc | grep mem_free
>>> mem_free            mf                MEMORY      <= YES         NO
>>> 0        0
>>>
>>> And it is reported on all nodes:
>>>
>>>> qhost -F mem_free -h gpu001
>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE SWAPTO
>>> SWAPUS
>>> -------------------------------------------------------------------------------
>>> global                  -               -     -       - -       -       -
>>> gpu001                         lx24-amd64     16  3.32  126.1G 37.2G    4.0G
>>> 0.0
>>>      Host Resource(s):      hl:mem_free=88.885G
>>>
>>> And everything was working until a week ago.
>>>
>>> Ilya.
>>>
>>> -------- Original Message --------
>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>> value of memory type: SGE reports it as unknown resource
>>> From: Ian Kaufman <ikaufman at eng.ucsd.edu>
>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>> Date: 1/23/15, 11:38 AM
>>>> Is mem_free defined in the host complex_values? What does
>>>>
>>>> qconf -sc | grep mem_free
>>>>
>>>> show? Is there a default value defined?
>>>>
>>>> Ian
>>>>
>>>> On Fri, Jan 23, 2015 at 11:30 AM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>>> Because I am testing with qsub -w v, the jobs is not accepted for
>>>>> scheduling, job id is not generated, and qstat -j will not work. The
>>>>> output
>>>>> of qsub is as I showed in the original email:
>>>>>
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu001" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu002" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu003" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu004" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu005" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu006" because
>>>>> job
>>>>> requests unknown resource (mem_free)
>>>>> ...
>>>>>
>>>>> Ilya.
>>>>>
>>>>>
>>>>> -------- Original Message --------
>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>> value of memory type: SGE reports it as unknown resource
>>>>> From: Feng Zhang <prod.feng at gmail.com>
>>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>>> Date: 1/23/15, 9:27 AM
>>>>>> Llya,
>>>>>>
>>>>>> Can you please run:
>>>>>>
>>>>>> qstat -j <jobid>
>>>>>>
>>>>>> and past the output here? It may be useful for checking the problem
>>>>>>
>>>>>> On Fri, Jan 23, 2015 at 12:08 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>>>>> Removed the quota limits. To no avail: same problems.
>>>>>>>
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>>>>> Date: 1/23/15, 2:33 AM
>>>>>>>> Can you remove them temporarily? I saw cases where suddenly the
>>>>>>>> "unknown
>>>>>>>> resource" popped up - and also suddenly vanished again, but it was
>>>>>>>> somehow
>>>>>>>> connected to RQS was my conclusion.
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>> Am 23.01.2015 um 00:16 schrieb Ilya M <4ilya.m+grid at gmail.com>:
>>>>>>>>>
>>>>>>>>> There are two RQS, one is disabled:
>>>>>>>>>
>>>>>>>>> {
>>>>>>>>>       name         limit_for_interns
>>>>>>>>>       description  "limit to max 5 GPU jobs per intern."
>>>>>>>>>       enabled      TRUE
>>>>>>>>>       limit        users {int1,int2} hosts @gpu to slots=5
>>>>>>>>> }
>>>>>>>>> {
>>>>>>>>>       name         limit_slots
>>>>>>>>>       description  NONE
>>>>>>>>>       enabled      FALSE
>>>>>>>>>       limit        hosts {@gpu} to slots=2
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------- Original Message --------
>>>>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a
>>>>>>>>> load
>>>>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>>>>> To: Ilya <4ilya.m+grid at gmail.com>
>>>>>>>>> Date: 1/21/15, 16:12
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Am 22.01.2015 um 00:52 schrieb Ilya:
>>>>>>>>>>
>>>>>>>>>>> Something happened to the SGE (6.2u5) that had been running fine
>>>>>>>>>>> for
>>>>>>>>>>> many months, and users can no longer put resource requests for load
>>>>>>>>>>> values
>>>>>>>>>>> if they are of memory type, e.g.
>>>>>>>>>>>
>>>>>>>>>>> qsub -l mem_free=5G -w v .... produces the following output:
>>>>>>>>>>>
>>>>>>>>>>> cannot run in queue "gpu.q at gpu038" because job requests unknown
>>>>>>>>>>> resource (mem_free)
>>>>>>>>>>>
>>>>>>>>>>> The resource is available, though, when querying for it:
>>>>>>>>>>> qhost -F mem_free -h gpu038
>>>>>>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>>>>>> SWAPTO
>>>>>>>>>>> SWAPUS
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>>>>> global                  -               -     -       - -       -
>>>>>>>>>>> -
>>>>>>>>>>> gpu038                         lx24-amd64     16  2.11  126.1G
>>>>>>>>>>> 15.7G
>>>>>>>>>>> 4.0G     0.0
>>>>>>>>>>>        Host Resource(s):      hl:mem_free=110.416G
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This was first reported by a user when he tried to request custom
>>>>>>>>>>> "hl"
>>>>>>>>>>> resource. However, it now appears that all "hl" resources of type
>>>>>>>>>>> "memory"
>>>>>>>>>>> show this behavior. Integer "hl" are OK.
>>>>>>>>>> Do you have any RQS in place?
>>>>>>>>>>
>>>>>>>>>> -- Reuti
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I bounced qmaster between master and shadow-master a couple of
>>>>>>>>>>> times,
>>>>>>>>>>> but it did not resolve the problem.
>>>>>>>>>>>
>>>>>>>>>>> Additionally, when I added MONITOR=1 to scheduler's configuration,
>>>>>>>>>>> the
>>>>>>>>>>> file $SGE_ROOT/$SGE_CELL/common/schedule contains only colons:
>>>>>>>>>>> ::::::::
>>>>>>>>>>> ::::::::
>>>>>>>>>>> ::::::::
>>>>>>>>>>>
>>>>>>>>>>> Any ideas?
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users at gridengine.org
>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users at gridengine.org
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>>
>> --
>> Ian Kaufman
>> Research Systems Administrator
>> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
>> _______________________________________________
>> users mailing list
>> users at gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>
>




More information about the users mailing list