[gridengine users] Recovering from disk corruption
w.hay at ucl.ac.uk
Tue Apr 17 16:29:11 UTC 2012
We recently had some planned downtime on our cluster to allow for
testing of out machine room's electrical supply. Unfortunately when
power was restored fsck found some corruption
on the filesystem that holds our grid engine configuration. It was
able to correct problems at the file system level but doesn't appear
to have managed to get everything correct.
The main effects appear to be:
1)qconf -sel only lists 63 of the 904 hosts in the cluster.
2)One of our admin hosts no longer appears to be present.
3)Although all our cluster queues and hostgroups appear to be correct
only a single queue instance is displayed by qstat -f
Although I can correct problems 1 and 2 using qconf if I softstop and
then restart the queue master they reoccur. In an attempt to
fix 3 I've rerun the scripts that recreate the cluster queues with a
cosmetcic change but this had no effect.
While investigating (1) I found that while we appear to have a file in
$SGE_ROOT/default/spool/exec_hosts for all 904 hosts
not all of them contained an exec host configuration. Some appeared
to be random files from the job spool instead. I've
removed these and used qconf to redefine the exec_hosts but this did
not make a difference.
I'm guessing that somewhere in the grid engine config there is another
file or files that aren't what they are supposed to be
causing this. While I can restore from backup or just do a fresh
install and run the various scripts which I used to create
our config I was wondering if anyone had written a tool to validate
the on disk config.
We use classic spool on 6.2u3
More information about the users