[gridengine users] qmaster killing "not supposed to be there" jobs

Chi Chan chichan.gridengine at yahoo.com.tw
Mon Nov 7 06:28:06 UTC 2011


I have the qmaster killed by Linux's OOM (Out Of Memory) killer on a

SGE 6.2u5 cluster, and it is using classic spooling, and I did not get
the problem you are getting.

Also, SGE 6.2u5 is out for ~2 years, yet I have not heard of users
getting this problem. Other mailing list and the original Sun Grid Engine
list did not have this problem reported (I joined with my original gmail
account on the sun list).


Each year my qmaster machine goes down for a few times, because the
users run jobs on the front-end (they always do that), and sometimes
the server gets unpluged by mistake. So I think it has gone down for 3
or 4 times since SGE 6.2u5 was installed, but I don't think I have any
jobs lost because of this.


May be it is due to a regression in one of the 8.0 changes if it is also
seen by the others using 8.0.
 

--Chi



----- 原始信件 ----

寄件者: Paul Brunk <pbrunk at uga.edu>

15:10:57|worker|rcluster|E|execd at compute-13-25.local
reports running job (84285.1/master) in queue
"biof-30d at compute-13-25.local" that was not supposed to be
there - killing

This happened with Univa GE 8.0.0 on RHEL 4, and with Son of Grid
Engine 8.0.0a on RHEL 5, both using classic spooling with SGE_ROOT on
a high-performance, though busy, NFSv3 mount.  It's definitely the
qmaster start, and not an exec host going down, which triggers the job
loss.  The job loss happened whether execd_spool_dir was on that
shared NFS filesystem or internal to each exec host.

I have a hunch that switching from classic spooling to berkeleydb
might prevent this from happening (because the job loss doesn't happen
on the RHEL 4 cluster when it runs SGE 6.2u5 with BDB spooling), but
that's just a hunch.

I'll add that the job loss happens in testing too, when we manually
kill the qmaster, so it's not that the qmaster deaths and the job
losses have a common cause.  (And so it's not quite Dave Love's SGE
ticket #1347.)

In all cases we have
qmaster_params               none
execd_params                 none
reschedule_unknown           00:00:00

And pretty much a default config, qconf-wise.

Does anyone have insight so far as to how to prevent this "job loss
upon qmaster restart"?

And is this still true, as someone posted in March?

"There are the following spooling options if you want to
setup sge_shadowd:

- classic spooling on nfs (or nfs4)
- Berkeley DB spooling on nfs4
- Berkeley DB RPC server (still available in Grid Engine
   6.2u5, but no longer supported with Univa Grid Engine
   8.0.0)"

I'd be glad to provide any further details.  Thanks!

-- Paul Brunk, system administrator
Georgia Advanced Computing Resource Center
(formerly "Research Computing Center")
Enterprise IT Svcs, University of Georgia

_______________________________________________
users mailing list
users at gridengine.org
https://gridengine.org/mailman/listinfo/users




More information about the users mailing list