[gridengine users] Strange emails following crash
Simon Matthews
simon.d.matthews at gmail.com
Thu Apr 16 04:30:41 UTC 2015
A couple of days ago, we had a power outage and our 6.2U5 SGE qmaster
would not start when the qmaster machine was rebooted. Running the
qmaster in foreground, I got a core dump.
I suspected that the spooldb was corrupted (we use Berkeley DB), I
re-created the spooldb/sge and spooldb/sge_job files using the
following procedure:
1. db_dump spooldb/sge to a file.
2. Create a new grid to get empty sge and sge_job dbs.
3. Copy the empty sge and sge_job files into my old spooldb
4. db_load the new spooldb/sge from the earlier db_dump.
We use Berkeley db spooling because we run a very large number of jobs
(mostly very small jobs).
With this process, the qmaster would start and my configuration was
retained from before the crash.
Now, I see occasional emails from the execd clients with the following:
Job 4433950 caused action: none
User = build
Queue = (null)@(null)
Start Time = <unknown>
End Time = <unknown>
failed before writing exit_status:shepherd exited with exit status 19:
before writing exit_status
As can be seen, the queue name is invalid.
Any idea what might cause this? How to stop this?
Simon
More information about the users
mailing list