[gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

Daniel Povey dpovey at gmail.com
Sat Nov 10 23:26:11 UTC 2018


Univa has different pricing depending who you are.  I think I calculated
based on what they published that it would be something like $5k per year
for our cluster, but after talking to them they said it would be about a
tenth of that for an academic institution.
But thanks for the info-- if it's not that much better I may reconsider.

Dan

On Sat, Nov 10, 2018 at 6:19 PM Joseph Farran <jfarran at uci.edu> wrote:

> Glad you were able to fix it Dan.
>
> I looked at Univa Grid Engine a while ago and it was super expensive.
>
> I was able to ask lots of question to a potential candidate for a position
> we had who was using Univa GE.   His sentiments were that it was better
> than the free version BUT not that much better and still plagued with
> "weird" issues.
>
> Since we cannot afford Univa our department big-wig wants to us to move to
> Slurm for out next cluster.    Not sure how much better Slum it but it does
> seem to have good support.
>
> Joseph
> On 11/10/2018 2:03 PM, Daniel Povey wrote:
>
> /var/spool/gridengineI was able to fix it, although I suspect that my fix
> may have been disruptive to the jobs.
>
> Firstly, I  believe the problem was that gridengine does not handle a
> deleted job that is on a host that has been deleted, and it dies when it
> sees it.   Presumably the bug is in allowing it to be deleted in the first
> place.
>
> Anyway, my fix (after backing up the directory /var/spool/gridengine) was
> to move the file /var/spool/gridengine/spooldb/sge_job to a temporary
> location, restart the qmaster, add the host back with qconf -ah, stop the
> qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job,
> and restart the qmaster.
>
> Before doing that whole procedure, to stop the hosts getting confused I
> stopped all the gridengine-exec services.  That probably wasn't optimal
> because clients like qsub and qstat would still have been able to access
> the queue in the interim, and it definitely would have confused them and
> killed some processes.  Unfortunately I had to do this on short notice and
> wasn't sure how to use iptables to close off those ports from outside the
> qmaster while I did the maintenance-- that would have been a better
> solution.
>
> Also I encountered a hiccup that `systemctl stop gridengine-qmaster`
> didn't actually work the second time, the process was still running, with
> the old database, so I had to manually kill it and retry.
>
> Anyway this whole episode is making me think more seriously about moving
> to Univa GridEngine.  I've known for a long time that the free version has
> a lot of bugs, and I just don't have time to deal with this type of thing.
>
>
> On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <
> john.marshall2 at canada.ca> wrote:
>
>> Hi,
>>
>> I've never seen this but I would start with:
>> 1) strace qmaster during restart to try to see at which point it is dying
>> (e.g.,
>> loading a config file)
>> 2) look for any reference to the name of the host you deleted in the spool
>> area and do some cleanup
>> 3) clean out the jobs spool area
>>
>> HTH,
>> John
>>
>> On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
>>
>> Has anyone found this error, and managed to fix it?
>> I am in a very difficult situation.
>> I deleted a host (qconf -de hostname) thinking that the machine no longer
>> existed, but it did exist, and there was a job in 'dr' state there.
>> After I attempted to force-delete that job (qdel -f job-id), the queue
>> master died with out-of-memory, and now I can't restart qmaster.
>>
>> So now I don't know hw to fix it.  Am I just completely lost now?
>>
>> Dan
>>
>> _______________________________________________
>>
>> users mailing list
>>
>> users at gridengine.org
>>
>> https://gridengine.org/mailman/listinfo/users
>>
>>
> _______________________________________________
> users mailing listusers at gridengine.orghttps://gridengine.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20181110/1c213f47/attachment.html>


More information about the users mailing list