[gridengine users] Jobs "adding up" resource reports (cpu, mem, io) from other jobs in the same node

Reuti reuti at staff.uni-marburg.de
Thu Jun 28 10:55:33 UTC 2012


Am 28.06.2012 um 12:05 schrieb Txema Heredia Genestar:

> Hello all,
> 
> We are currently using GE6.2u5 on our new cluster, and we have started noticing one issue that we have also seen before in older GE versions in other clusters:
> Sometimes, when there are a lot of jobs from the same user and with similar jobnames and execution commands (don't now which is the cause and which the effect), the job usage (cpu, mem, io) goes nuts. Inside one execution node, some jobs report 0, and another report the sum of them all.
> 
> For example:
> Right now we have 300 jobs running, 25 execution hosts, 12 jobs each.
> 3 different users running jobs there: user1=248 jobs, user2=45 jobs, user3=7 jobs
> 
> In execution host compute-0-2 we find this:
> (user - reported maxvmem - reported time)
> user1 - 36.1M - 19:48:14   <- normal
> user1 - 34.2M - 19:48:47   <- normal
> user1 - 360.074M - 8:06:46:27   <- "steals" from other jobs
> user1 - N/A - 00:00:00   <- "victim" should be like the rest
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00
> user1 - N/A - 00:00:00

How do you generate thse lists?

-- Reuti


> We find similar scenarios in other nodes:
> compute-0-10:
> 5x user1 - ~35M - 19:49:00
> 1x user1 - 241.7M - 5:18:35:44
> 6x user1 - N/A - 00:00:00
> 
> compute-1-1:
> 1x user1 - 524.2M - 9:21:37:51
> 11x user1 - N/A - 00:00:00
> 
> compute-1-3:
> 2x user1 - ~40M - 19:49:00
> 1x user1 - 380.6M - 08:06:05:13
> 9x user1 - N/A - 00:00:00
> 
> compute-1-7:
> 10x user1 - ~47M - 19:49:00
> 1x user1 - 73.8M - 01:15:38:07
> 1x user1 - N/A - 00:00:00
> 
> compute-1-12:
> 6x user1 - ~40M - 19:49:00
> 1x user1 - 211.4M - 04:22:52:58
> 5x user1 - N/A - 00:00:00
> 
> 
> But now it gets weirder, as, in some cases, user1's jobs "steal" the usage from user3's jobs:
> 
> compute-0-1
> 6x user1 - ~40M - 19:49:00
> 2x user2 - 3.6Gb - 03:30:00
> 1x user1 - 519M - 21:30:03
> 3x user3 - N/A - 00:00:00  <- should be ~150Mb and ~00:30:00
> 
> 
> As user3's jobs are much newer, it seems that they are only "misbehabing" when a user1's "victim" job finishes and they are scheduled to that same node. In fact I have just seen it happen.
> 
> 
> This is pretty annoying, not just for reporting and planning issues, but because, when it happens, it screws up any kind of memory or cpu limit we tried to set for the job. It already happened to us several times to kill a job for excessive memory usage (h_vmem) and later find out it was just adding up other jobs resources.
> 
> Is there any fix for that?
> 
> Thanks in advance,
> 
> Txema
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users




More information about the users mailing list