[gridengine users] Node with negative value of a consumable

Reuti reuti at staff.uni-marburg.de
Tue Jun 14 09:18:26 UTC 2011


Am 14.06.2011 um 09:45 schrieb Javier Lopez Cacheiro:

> Hi Reuti,
> 
> El 13/06/11 15:30, Reuti escribió:
>> Hi,
>> 
>> Am 13.06.2011 um 15:12 schrieb Javier Lopez Cacheiro:
>> 
>>> We have found a strange situation where GE 6.2u5 has allocated more resources in a node than available, leaving a consumable with a value lower than 0 (in this case the consumable is num_proc).
>>> 
>>> This is somehow similar to an issue that was found some time ago in SGE 6.2 (issue 2091) but in that case it was related to mpi jobs with fillup allocation rule, and it was already solved in 6.2u3.
>>> 
>>> Now this is somehow different because it is not affecting mpi jobs but a non-mpi job and it is occurring only in certain circumstances that are still not clear.
>>> 
>>> In this case the situation was that at 06:13:57 the node had already 7 jobs running, consuming 24 units of num_proc. Num_proc it is configured as a consumable with a value of 24. So at that time the value of num_proc was 0. But 4 seconds later, at 06:14:01, a new job was started in the node that requested 24 num_proc, leaving the node with a value of -24 for num_proc.
>> num_proc is (fixed) feature of a node and shouldn't be made consumable. Is there any reason why you don't use slots?
>> 
> num_proc is used for historical reasons, not sure why slots was not chosen instead.
> 
> In the other case we found num_proc < 0 we also did some tests using a new complex instead of num_proc with the same results.
> 
> In this case it is difficult to reproduce the problem using a new complex because it has been an uncommon situation and it is not clear what were the the circumstances that lead to it.
> 
> For example it is quite strange that all the jobs entered in the node in a period shorter than 1 minute. The only warnings that appear in the log of the node at that time are related to core binding:
> 
> 06/10/2011 06:12:35|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:13:01|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:13:30|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:13:41|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:13:54|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:13:57|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!
> 06/10/2011 06:14:01|  main|compute-5-11|W|Core binding: Couldn't determine core binding string for config file!

These can safely be ignored.


>> Nevertheless: do you request anything else with the -l option?
> Yes, several other complexes are also requested: h_fsize, s_vmem and s_rt

Then it looks like the issue I posted, although I referred more to limits.


> I can not tell now if the other consumable complexes (h_fsize

You made h_fsize consumable? It's a limit per process, and so the total amount can be bypassed by several processes of the same job anyway.


> and s_vmem)

I think that this doesn't need to be consumable, as you made h_vmem consumable already. It tells SGE when to send the SIGXCPU warning.

-- Reuti


> had also negative values but I guess no because disk and memory consumption in the node was far below the available resources.
> 
> Cheers,
> Javier
>> -- Reuti
>> 
>> 
>>> I don't know if anyone else has come over this same problem with 6.2u5 and if there is a workaround for it.
>>> 
>>> [jlopez at svgd ~]$ qhost -q -j -h c5-11
>>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
>>> SWAPUS
>>> -------------------------------------------------------------------------------
>>> global - - - - - - -
>>> compute-5-11 x86_64 -24 47.92 31.5G 9.0G 8.0G 0.0
>>> GRID_large BP 0/4/24
>>> 6667492 1.92242 STDIN compchem015 r 06/10/2011 06:13:30 MASTER
>>> 6667493 1.92241 STDIN compchem015 r 06/10/2011 06:13:41 MASTER
>>> 6667494 1.92241 STDIN compchem015 r 06/10/2011 06:13:47 MASTER
>>> 6667495 1.92241 STDIN compchem015 r 06/10/2011 06:13:57 MASTER
>>> GRID_small BP 0/0/24
>>> small BPC 0/10/24
>>> 6652641 11.27961 p1761-7 csebdmfa r 06/10/2011 06:14:01 MASTER
>>> 6655259 10.43999 p577-16 csebdmfa r 06/10/2011 06:12:26 MASTER
>>> 6667942 3.93900 AuLJ139 csmyslfs r 06/10/2011 06:12:46 MASTER
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> g0-mem_small BPC 0/0/24
>>> offline BP 0/0/24
>>> 
>>> 
>>> Thanks in advance,
>>> Javier
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users at gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> <jlopez.vcf>_______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list