[gridengine users] Strange emails following crash

Reuti reuti at staff.uni-marburg.de
Thu Apr 16 09:23:13 UTC 2015


Hi,

> Am 16.04.2015 um 06:30 schrieb Simon Matthews <simon.d.matthews at gmail.com>:
> 
> A couple of days ago, we had a power outage and our 6.2U5 SGE qmaster
> would not start when the qmaster machine was rebooted. Running the
> qmaster in foreground, I got a core dump.
> 
> I suspected that the spooldb was corrupted (we use Berkeley DB), I
> re-created the spooldb/sge and spooldb/sge_job files using the
> following procedure:
> 1. db_dump spooldb/sge to a file.
> 2. Create a new grid to get empty sge and sge_job dbs.
> 3. Copy the empty sge and sge_job files into my old spooldb
> 4. db_load the new spooldb/sge from the earlier db_dump.

Are there jobs in pending state you want to keep? You can try to save SGE's configuration, start from a fresh spooling DB, and restore the settings:

$SGE_ROOT/sge/util/upgrade_modules/load_sge_config.sh  resp. save_sge_config.sh

-- Reuti


> We use Berkeley db spooling because we run a very large number of jobs
> (mostly very small jobs).
> 
> With this process, the qmaster would start and my configuration was
> retained from before the crash.
> 
> Now, I see occasional emails from the execd clients with the following:
> Job 4433950 caused action: none
> User        = build
> Queue       = (null)@(null)
> Start Time  = <unknown>
> End Time    = <unknown>
> failed before writing exit_status:shepherd exited with exit status 19:
> before writing exit_status
> 
> As can be seen, the queue name is invalid.
> 
> Any idea what might cause this? How to stop this?
> 
> Simon
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users





More information about the users mailing list