[gridengine users] Mark an execution host as 'errored' in case of a NIS error

Dave Love d.love at liverpool.ac.uk
Sun Mar 24 21:43:49 UTC 2013

Campbell McLeay <campbell.mcleay at primefocusworld.com> writes:

> Thanks for all the suggestions. In the end the easiest was to just
> monitor the qmaster log and run a qmod -sq on the execution host to
> suspend it, then a qmod -cj to resubmit the job. This will do until
> sssd gets fixed

Off topic, but why do you need sssd anyway?  It's been installed on our
new cluster, and is something I'll probably get rid of when I get

> Cheers,
> Campbell
> On 18/03/13 13:59, William Hay wrote:
>> On 15 March 2013 14:36, Campbell McLeay
>> <campbell.mcleay at primefocusworld.com> wrote:
>>> Hi,
>>> We're running Grid Engine 6.2u5, and we're having an issue with jobs not
>>> getting run because one node (out of several hundred) has a NIS error
>>> (so can't run the job). The whole job then sits in an error state, due
>>> presumably to the returned prolog errors. Is it possible to have the
>>> host set to 'Errored' in case of a NIS error, so it won't accept any
>>> more jobs? I haven't been able to find a way to do this so far.
>>> Cheers,
>> Rather than an error state what about an alarm state?  Write a load
>> sensor that detects the problem and set an appropriate load_threshold
>> on each queue to put the queue into alarm state when the problem is
>> detected.

Yes, or have Nagios (or whatever) disable/restrict the host.  There are
a couple of frameworks for executing tests listed in
<http://arc.liv.ac.uk/SGE/tools.html>, if you think the infrastructure
is worth it.

>> There is still
>> a risk of a race if the job starts just as the problem manifests but
>> it shouldn't be too bad.  As Reuti suggested you could have the prolog
>> run as a local user in order to work around the problem.

The prolog isn't really useful for sanity tests for distributed jobs
unless all the hosts are equally broken.  There ought to be a per-node

>> If you are
>> using 6.2u5 there are some security issues that arise from running
>> prolog/epilog as something other than the user (in particular running
>> as root).   It is possible to work around these issues but you need to
>> be careful. Alternatively an upgrade to one of the live forks should
>> solve the security issue.

[In a default configuration without CSP you have root on the exec nodes
anyhow <http://arc.liv.ac.uk/SGE/howto/sge-security.html>.]

Community Grid Engine:  http://arc.liv.ac.uk/SGE/

More information about the users mailing list