[gridengine users] Node with negative value of a consumable

Reuti reuti at staff.uni-marburg.de
Mon Jun 13 13:30:32 UTC 2011


Hi,

Am 13.06.2011 um 15:12 schrieb Javier Lopez Cacheiro:

> We have found a strange situation where GE 6.2u5 has allocated more resources in a node than available, leaving a consumable with a value lower than 0 (in this case the consumable is num_proc).
> 
> This is somehow similar to an issue that was found some time ago in SGE 6.2 (issue 2091) but in that case it was related to mpi jobs with fillup allocation rule, and it was already solved in 6.2u3.
> 
> Now this is somehow different because it is not affecting mpi jobs but a non-mpi job and it is occurring only in certain circumstances that are still not clear.
> 
> In this case the situation was that at 06:13:57 the node had already 7 jobs running, consuming 24 units of num_proc. Num_proc it is configured as a consumable with a value of 24. So at that time the value of num_proc was 0. But 4 seconds later, at 06:14:01, a new job was started in the node that requested 24 num_proc, leaving the node with a value of -24 for num_proc.

num_proc is (fixed) feature of a node and shouldn't be made consumable. Is there any reason why you don't use slots?

Nevertheless: do you request anything else with the -l option?

-- Reuti


> I don't know if anyone else has come over this same problem with 6.2u5 and if there is a workaround for it.
> 
> [jlopez at svgd ~]$ qhost -q -j -h c5-11
> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
> SWAPUS
> -------------------------------------------------------------------------------
> global - - - - - - -
> compute-5-11 x86_64 -24 47.92 31.5G 9.0G 8.0G 0.0
> GRID_large BP 0/4/24
> 6667492 1.92242 STDIN compchem015 r 06/10/2011 06:13:30 MASTER
> 6667493 1.92241 STDIN compchem015 r 06/10/2011 06:13:41 MASTER
> 6667494 1.92241 STDIN compchem015 r 06/10/2011 06:13:47 MASTER
> 6667495 1.92241 STDIN compchem015 r 06/10/2011 06:13:57 MASTER
> GRID_small BP 0/0/24
> small BPC 0/10/24
> 6652641 11.27961 p1761-7 csebdmfa r 06/10/2011 06:14:01 MASTER
> 6655259 10.43999 p577-16 csebdmfa r 06/10/2011 06:12:26 MASTER
> 6667942 3.93900 AuLJ139 csmyslfs r 06/10/2011 06:12:46 MASTER
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> g0-mem_small BPC 0/0/24
> offline BP 0/0/24
> 
> 
> Thanks in advance,
> Javier
> 
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list