[gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

Daniel Povey dpovey at gmail.com
Sat Nov 10 22:03:13 UTC 2018


/var/spool/gridengineI was able to fix it, although I suspect that my fix
may have been disruptive to the jobs.

Firstly, I  believe the problem was that gridengine does not handle a
deleted job that is on a host that has been deleted, and it dies when it
sees it.   Presumably the bug is in allowing it to be deleted in the first
place.

Anyway, my fix (after backing up the directory /var/spool/gridengine) was
to move the file /var/spool/gridengine/spooldb/sge_job to a temporary
location, restart the qmaster, add the host back with qconf -ah, stop the
qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job,
and restart the qmaster.

Before doing that whole procedure, to stop the hosts getting confused I
stopped all the gridengine-exec services.  That probably wasn't optimal
because clients like qsub and qstat would still have been able to access
the queue in the interim, and it definitely would have confused them and
killed some processes.  Unfortunately I had to do this on short notice and
wasn't sure how to use iptables to close off those ports from outside the
qmaster while I did the maintenance-- that would have been a better
solution.

Also I encountered a hiccup that `systemctl stop gridengine-qmaster` didn't
actually work the second time, the process was still running, with the old
database, so I had to manually kill it and retry.

Anyway this whole episode is making me think more seriously about moving to
Univa GridEngine.  I've known for a long time that the free version has a
lot of bugs, and I just don't have time to deal with this type of thing.


On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <
john.marshall2 at canada.ca> wrote:

> Hi,
>
> I've never seen this but I would start with:
> 1) strace qmaster during restart to try to see at which point it is dying
> (e.g.,
> loading a config file)
> 2) look for any reference to the name of the host you deleted in the spool
> area and do some cleanup
> 3) clean out the jobs spool area
>
> HTH,
> John
>
> On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
>
> Has anyone found this error, and managed to fix it?
> I am in a very difficult situation.
> I deleted a host (qconf -de hostname) thinking that the machine no longer
> existed, but it did exist, and there was a job in 'dr' state there.
> After I attempted to force-delete that job (qdel -f job-id), the queue
> master died with out-of-memory, and now I can't restart qmaster.
>
> So now I don't know hw to fix it.  Am I just completely lost now?
>
> Dan
>
> _______________________________________________
>
> users mailing list
>
> users at gridengine.org
>
> https://gridengine.org/mailman/listinfo/users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20181110/ce10f1ec/attachment.html>


More information about the users mailing list