[gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

Ian Kaufman ikaufman at eng.ucsd.edu
Fri Jan 23 21:35:51 UTC 2015


So it is requestable, but not consumable, and there is no default set
in the complex. Well, the default is set to zero, but I don't think
that is treated as a default.

Is that what was intended - requestable but not consumable?

Ian

On Fri, Jan 23, 2015 at 12:36 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
> Natually, it does:
>
>> qconf -sc | grep mem_free
> mem_free            mf                MEMORY      <= YES         NO
> 0        0
>
> And it is reported on all nodes:
>
>> qhost -F mem_free -h gpu001
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE SWAPTO
> SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       - -       -       -
> gpu001                         lx24-amd64     16  3.32  126.1G 37.2G    4.0G
> 0.0
>     Host Resource(s):      hl:mem_free=88.885G
>
> And everything was working until a week ago.
>
> Ilya.
>
> -------- Original Message --------
> Subject: Re: [gridengine users] Cannot request resource if it is a load
> value of memory type: SGE reports it as unknown resource
> From: Ian Kaufman <ikaufman at eng.ucsd.edu>
> To: Ilya M <4ilya.m+grid at gmail.com>
> Date: 1/23/15, 11:38 AM
>>
>> Is mem_free defined in the host complex_values? What does
>>
>> qconf -sc | grep mem_free
>>
>> show? Is there a default value defined?
>>
>> Ian
>>
>> On Fri, Jan 23, 2015 at 11:30 AM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>
>>> Because I am testing with qsub -w v, the jobs is not accepted for
>>> scheduling, job id is not generated, and qstat -j will not work. The
>>> output
>>> of qsub is as I showed in the original email:
>>>
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu001" because
>>> job
>>> requests unknown resource (mem_free)
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu002" because
>>> job
>>> requests unknown resource (mem_free)
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu003" because
>>> job
>>> requests unknown resource (mem_free)
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu004" because
>>> job
>>> requests unknown resource (mem_free)
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu005" because
>>> job
>>> requests unknown resource (mem_free)
>>> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu006" because
>>> job
>>> requests unknown resource (mem_free)
>>> ...
>>>
>>> Ilya.
>>>
>>>
>>> -------- Original Message --------
>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>> value of memory type: SGE reports it as unknown resource
>>> From: Feng Zhang <prod.feng at gmail.com>
>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>> Date: 1/23/15, 9:27 AM
>>>>
>>>> Llya,
>>>>
>>>> Can you please run:
>>>>
>>>> qstat -j <jobid>
>>>>
>>>> and past the output here? It may be useful for checking the problem
>>>>
>>>> On Fri, Jan 23, 2015 at 12:08 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>>>
>>>>> Removed the quota limits. To no avail: same problems.
>>>>>
>>>>>
>>>>> -------- Original Message --------
>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>> value of memory type: SGE reports it as unknown resource
>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>>>> Date: 1/23/15, 2:33 AM
>>>>>>
>>>>>> Can you remove them temporarily? I saw cases where suddenly the
>>>>>> "unknown
>>>>>> resource" popped up - and also suddenly vanished again, but it was
>>>>>> somehow
>>>>>> connected to RQS was my conclusion.
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> Am 23.01.2015 um 00:16 schrieb Ilya M <4ilya.m+grid at gmail.com>:
>>>>>>>
>>>>>>> There are two RQS, one is disabled:
>>>>>>>
>>>>>>> {
>>>>>>>      name         limit_for_interns
>>>>>>>      description  "limit to max 5 GPU jobs per intern."
>>>>>>>      enabled      TRUE
>>>>>>>      limit        users {int1,int2} hosts @gpu to slots=5
>>>>>>> }
>>>>>>> {
>>>>>>>      name         limit_slots
>>>>>>>      description  NONE
>>>>>>>      enabled      FALSE
>>>>>>>      limit        hosts {@gpu} to slots=2
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a
>>>>>>> load
>>>>>>> value of memory type: SGE reports it as unknown resource
>>>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>>>> To: Ilya <4ilya.m+grid at gmail.com>
>>>>>>> Date: 1/21/15, 16:12
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 22.01.2015 um 00:52 schrieb Ilya:
>>>>>>>>
>>>>>>>>> Something happened to the SGE (6.2u5) that had been running fine
>>>>>>>>> for
>>>>>>>>> many months, and users can no longer put resource requests for load
>>>>>>>>> values
>>>>>>>>> if they are of memory type, e.g.
>>>>>>>>>
>>>>>>>>> qsub -l mem_free=5G -w v .... produces the following output:
>>>>>>>>>
>>>>>>>>> cannot run in queue "gpu.q at gpu038" because job requests unknown
>>>>>>>>> resource (mem_free)
>>>>>>>>>
>>>>>>>>> The resource is available, though, when querying for it:
>>>>>>>>> qhost -F mem_free -h gpu038
>>>>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>>>> SWAPTO
>>>>>>>>> SWAPUS
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------------------------------
>>>>>>>>> global                  -               -     -       - -       -
>>>>>>>>> -
>>>>>>>>> gpu038                         lx24-amd64     16  2.11  126.1G
>>>>>>>>> 15.7G
>>>>>>>>> 4.0G     0.0
>>>>>>>>>       Host Resource(s):      hl:mem_free=110.416G
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This was first reported by a user when he tried to request custom
>>>>>>>>> "hl"
>>>>>>>>> resource. However, it now appears that all "hl" resources of type
>>>>>>>>> "memory"
>>>>>>>>> show this behavior. Integer "hl" are OK.
>>>>>>>>
>>>>>>>> Do you have any RQS in place?
>>>>>>>>
>>>>>>>> -- Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>> I bounced qmaster between master and shadow-master a couple of
>>>>>>>>> times,
>>>>>>>>> but it did not resolve the problem.
>>>>>>>>>
>>>>>>>>> Additionally, when I added MONITOR=1 to scheduler's configuration,
>>>>>>>>> the
>>>>>>>>> file $SGE_ROOT/$SGE_CELL/common/schedule contains only colons:
>>>>>>>>> ::::::::
>>>>>>>>> ::::::::
>>>>>>>>> ::::::::
>>>>>>>>>
>>>>>>>>> Any ideas?
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users at gridengine.org
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>>
>>
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu



More information about the users mailing list