[gridengine users] can't delete an exec host

Michael Stauffer mgstauff at gmail.com
Wed Sep 6 16:05:18 UTC 2017


That was it, thanks! The node had failed so I didn't think there'd be
anything running on there, but two jobs were stuck in the basic.q on that
node. I've killed them and now can remove host compute-2-4.

-M

On Wed, Sep 6, 2017 at 11:41 AM, Feng Zhang <prod.feng at gmail.com> wrote:

> Is there any running jobs on queue instance of compute-2-4 at basic.q?
>
> On Wed, Sep 6, 2017 at 11:33 AM, Michael Stauffer <mgstauff at gmail.com>
> wrote:
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.feng at gmail.com> wrote:
> >>
> >> It seems SGE master did not get refreshed with new hostgroup. Maybe you
> >> can try:
> >>
> >> 1. restart SGE master
> >
> >
> > Is it safe to do this with jobs queued and running? I think it's not
> > reliable, i.e. jobs can get killed and de-queued?
> >
> >>
> >> or
> >>
> >> 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> >>
> >> wait till it gets refreshed; then change it back to "@basichosts".
> >
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm
> > still getting the error when I try to delete exec host compute-2-4, and
> > qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue
> > (qlogin.basic.q) that also uses @basichosts, so it's something about
> basic.q
> > that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >>
> >>
> >>
> >>
> >> On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer <mgstauff at gmail.com>
> >> wrote:
> >> > SoGE 8.1.8
> >> >
> >> > Hi,
> >> >
> >> > I'm having trouble deleting an execution host. I've removed it from
> the
> >> > host group, but when I try to delete with qconf, it says it's still
> part
> >> > of
> >> > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> >> >
> >> > [root at chead ~]# qconf -de compute-2-4.local
> >> > Host object "compute-2-4.local" is still referenced in cluster queue
> >> > "basic.q".
> >> >
> >> > [root at chead ~]# qconf -sq basic.q
> >> > qname                 basic.q
> >> > hostlist              @basichosts
> >> > seq_no                0
> >> > load_thresholds       np_load_avg=1.74
> >> > suspend_thresholds    NONE
> >> > nsuspend              1
> >> > suspend_interval      00:05:00
> >> > priority              0
> >> > min_cpu_interval      00:05:00
> >> > processors            UNDEFINED
> >> > qtype                 BATCH
> >> > ckpt_list             NONE
> >> > pe_list               make mpich mpi orte unihost serial
> >> > rerun                 FALSE
> >> > slots                 8,[compute-1-2.local=3],[compute-1-0.local=7],
> \
> >> >                       [compute-1-1.local=7],[compute-1-3.local=7], \
> >> >                       [compute-1-5.local=8],[compute-1-6.local=8], \
> >> >                       [compute-1-7.local=8],[compute-1-8.local=8], \
> >> >                       [compute-1-9.local=8],[compute-1-10.local=8], \
> >> >                       [compute-1-11.local=8],[compute-1-12.local=8],
> \
> >> >                       [compute-1-13.local=8],[compute-1-14.local=8],
> \
> >> >                       [compute-1-15.local=8]
> >> > tmpdir                /tmp
> >> > shell                 /bin/bash
> >> > prolog                NONE
> >> > epilog                NONE
> >> > shell_start_mode      posix_compliant
> >> > starter_method        NONE
> >> > suspend_method        NONE
> >> > resume_method         NONE
> >> > terminate_method      NONE
> >> > notify                00:00:60
> >> > owner_list            NONE
> >> > user_lists            NONE
> >> > xuser_lists           NONE
> >> > subordinate_list      NONE
> >> > complex_values        NONE
> >> > projects              NONE
> >> > xprojects             NONE
> >> > calendar              NONE
> >> > initial_state         default
> >> > s_rt                  INFINITY
> >> > h_rt                  INFINITY
> >> > s_cpu                 INFINITY
> >> > h_cpu                 INFINITY
> >> > s_fsize               INFINITY
> >> > h_fsize               INFINITY
> >> > s_data                INFINITY
> >> > h_data                INFINITY
> >> > s_stack               INFINITY
> >> > h_stack               INFINITY
> >> > s_core                INFINITY
> >> > h_core                INFINITY
> >> > s_rss                 INFINITY
> >> > h_rss                 INFINITY
> >> > s_vmem                19G
> >> > h_vmem                19G
> >> >
> >> > [root at chead ~]# qconf -shgrp @basichosts
> >> > group_name @basichosts
> >> > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> >> >          compute-1-5.local compute-1-6.local compute-1-7.local \
> >> >          compute-1-8.local compute-1-9.local compute-1-10.local \
> >> >          compute-1-11.local compute-1-12.local compute-1-13.local \
> >> >          compute-1-14.local compute-1-15.local compute-2-0.local \
> >> >          compute-2-2.local compute-2-5.local compute-2-7.local \
> >> >          compute-2-8.local compute-2-9.local compute-2-11.local \
> >> >          compute-2-12.local compute-2-13.local compute-2-15.local \
> >> >          compute-2-6.local
> >> >
> >> > Thanks
> >> >
> >> > -M
> >> >
> >> > _______________________________________________
> >> > users mailing list
> >> > users at gridengine.org
> >> > https://gridengine.org/mailman/listinfo/users
> >> >
> >>
> >>
> >>
> >> --
> >> Best,
> >>
> >> Feng
> >
> >
>
>
>
> --
> Best,
>
> Feng
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20170906/2e1d2d9e/attachment.html>


More information about the users mailing list