[gridengine users] Strange emails following crash

Simon Matthews simon.d.matthews at gmail.com
Thu Apr 16 04:30:41 UTC 2015


A couple of days ago, we had a power outage and our 6.2U5 SGE qmaster
would not start when the qmaster machine was rebooted. Running the
qmaster in foreground, I got a core dump.

I suspected that the spooldb was corrupted (we use Berkeley DB), I
re-created the spooldb/sge and spooldb/sge_job files using the
following procedure:
1. db_dump spooldb/sge to a file.
2. Create a new grid to get empty sge and sge_job dbs.
3. Copy the empty sge and sge_job files into my old spooldb
4. db_load the new spooldb/sge from the earlier db_dump.

We use Berkeley db spooling because we run a very large number of jobs
(mostly very small jobs).

With this process, the qmaster would start and my configuration was
retained from before the crash.

Now, I see occasional emails from the execd clients with the following:
Job 4433950 caused action: none
 User        = build
 Queue       = (null)@(null)
 Start Time  = <unknown>
 End Time    = <unknown>
failed before writing exit_status:shepherd exited with exit status 19:
before writing exit_status

As can be seen, the queue name is invalid.

Any idea what might cause this? How to stop this?

Simon



More information about the users mailing list