[gridengine users] Cannot request resource if it is a load value of memory type: SGE reports it as unknown resource

Ian Kaufman ikaufman at eng.ucsd.edu
Fri Jan 23 19:38:19 UTC 2015


Is mem_free defined in the host complex_values? What does

qconf -sc | grep mem_free

show? Is there a default value defined?

Ian

On Fri, Jan 23, 2015 at 11:30 AM, Ilya M <4ilya.m+grid at gmail.com> wrote:
> Because I am testing with qsub -w v, the jobs is not accepted for
> scheduling, job id is not generated, and qstat -j will not work. The output
> of qsub is as I showed in the original email:
>
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu001" because job
> requests unknown resource (mem_free)
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu002" because job
> requests unknown resource (mem_free)
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu003" because job
> requests unknown resource (mem_free)
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu004" because job
> requests unknown resource (mem_free)
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu005" because job
> requests unknown resource (mem_free)
> Job 2210897 (mem_free=100G) cannot run in queue "gpu.q at gpu006" because job
> requests unknown resource (mem_free)
> ...
>
> Ilya.
>
>
> -------- Original Message --------
> Subject: Re: [gridengine users] Cannot request resource if it is a load
> value of memory type: SGE reports it as unknown resource
> From: Feng Zhang <prod.feng at gmail.com>
> To: Ilya M <4ilya.m+grid at gmail.com>
> Date: 1/23/15, 9:27 AM
>>
>> Llya,
>>
>> Can you please run:
>>
>> qstat -j <jobid>
>>
>> and past the output here? It may be useful for checking the problem
>>
>> On Fri, Jan 23, 2015 at 12:08 PM, Ilya M <4ilya.m+grid at gmail.com> wrote:
>>>
>>> Removed the quota limits. To no avail: same problems.
>>>
>>>
>>> -------- Original Message --------
>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>> value of memory type: SGE reports it as unknown resource
>>> From: Reuti <reuti at staff.uni-marburg.de>
>>> To: Ilya M <4ilya.m+grid at gmail.com>
>>> Date: 1/23/15, 2:33 AM
>>>>
>>>> Can you remove them temporarily? I saw cases where suddenly the "unknown
>>>> resource" popped up - and also suddenly vanished again, but it was
>>>> somehow
>>>> connected to RQS was my conclusion.
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> Am 23.01.2015 um 00:16 schrieb Ilya M <4ilya.m+grid at gmail.com>:
>>>>>
>>>>> There are two RQS, one is disabled:
>>>>>
>>>>> {
>>>>>     name         limit_for_interns
>>>>>     description  "limit to max 5 GPU jobs per intern."
>>>>>     enabled      TRUE
>>>>>     limit        users {int1,int2} hosts @gpu to slots=5
>>>>> }
>>>>> {
>>>>>     name         limit_slots
>>>>>     description  NONE
>>>>>     enabled      FALSE
>>>>>     limit        hosts {@gpu} to slots=2
>>>>> }
>>>>>
>>>>>
>>>>> -------- Original Message --------
>>>>> Subject: Re: [gridengine users] Cannot request resource if it is a load
>>>>> value of memory type: SGE reports it as unknown resource
>>>>> From: Reuti <reuti at staff.uni-marburg.de>
>>>>> To: Ilya <4ilya.m+grid at gmail.com>
>>>>> Date: 1/21/15, 16:12
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 22.01.2015 um 00:52 schrieb Ilya:
>>>>>>
>>>>>>> Something happened to the SGE (6.2u5) that had been running fine for
>>>>>>> many months, and users can no longer put resource requests for load
>>>>>>> values
>>>>>>> if they are of memory type, e.g.
>>>>>>>
>>>>>>> qsub -l mem_free=5G -w v .... produces the following output:
>>>>>>>
>>>>>>> cannot run in queue "gpu.q at gpu038" because job requests unknown
>>>>>>> resource (mem_free)
>>>>>>>
>>>>>>> The resource is available, though, when querying for it:
>>>>>>> qhost -F mem_free -h gpu038
>>>>>>> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>>>>>>> SWAPTO
>>>>>>> SWAPUS
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------------------------------
>>>>>>> global                  -               -     -       - -       -
>>>>>>> -
>>>>>>> gpu038                         lx24-amd64     16  2.11  126.1G 15.7G
>>>>>>> 4.0G     0.0
>>>>>>>      Host Resource(s):      hl:mem_free=110.416G
>>>>>>>
>>>>>>>
>>>>>>> This was first reported by a user when he tried to request custom
>>>>>>> "hl"
>>>>>>> resource. However, it now appears that all "hl" resources of type
>>>>>>> "memory"
>>>>>>> show this behavior. Integer "hl" are OK.
>>>>>>
>>>>>> Do you have any RQS in place?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> I bounced qmaster between master and shadow-master a couple of times,
>>>>>>> but it did not resolve the problem.
>>>>>>>
>>>>>>> Additionally, when I added MONITOR=1 to scheduler's configuration,
>>>>>>> the
>>>>>>> file $SGE_ROOT/$SGE_CELL/common/schedule contains only colons:
>>>>>>> ::::::::
>>>>>>> ::::::::
>>>>>>> ::::::::
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users at gridengine.org
>>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users at gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>
>>
>>
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu



More information about the users mailing list