[gridengine users] can't delete an exec host

Michael Stauffer mgstauff at gmail.com
Wed Sep 6 17:22:04 UTC 2017


On Wed, Sep 6, 2017 at 12:42 PM, Reuti <reuti at staff.uni-marburg.de> wrote:

>
> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer <mgstauff at gmail.com>:
> >
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.feng at gmail.com> wrote:
> > It seems SGE master did not get refreshed with new hostgroup. Maybe you
> can try:
> >
> > 1. restart SGE master
> >
> > Is it safe to do this with jobs queued and running? I think it's not
> reliable, i.e. jobs can get killed and de-queued?
>
> Just to mention, that it's safe to restart the qmaster or reboot even the
> machine the qmaster is running on. Nothing will happen to the running jobs
> on the exechosts.
>

OK good to know. I've done that before and seen them finish, although some
googling suggested people have seen jobs get killed. Does a qmaster
restart, however, empty the queue? I imagine a reboot would too, unless the
queue is stored in a file?

-M


>
> -- Reuti
>
>
> > or
> >
> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> > wait till it gets refreshed; then change it back to "@basichosts".
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm
> still getting the error when I try to delete exec host compute-2-4, and
> qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue
> (qlogin.basic.q) that also uses @basichosts, so it's something about
> basic.q that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >
> >
> >
> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer <mgstauff at gmail.com>
> wrote:
> > > SoGE 8.1.8
> > >
> > > Hi,
> > >
> > > I'm having trouble deleting an execution host. I've removed it from the
> > > host group, but when I try to delete with qconf, it says it's still
> part of
> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> > >
> > > [root at chead ~]# qconf -de compute-2-4.local
> > > Host object "compute-2-4.local" is still referenced in cluster queue
> > > "basic.q".
> > >
> > > [root at chead ~]# qconf -sq basic.q
> > > qname                 basic.q
> > > hostlist              @basichosts
> > > seq_no                0
> > > load_thresholds       np_load_avg=1.74
> > > suspend_thresholds    NONE
> > > nsuspend              1
> > > suspend_interval      00:05:00
> > > priority              0
> > > min_cpu_interval      00:05:00
> > > processors            UNDEFINED
> > > qtype                 BATCH
> > > ckpt_list             NONE
> > > pe_list               make mpich mpi orte unihost serial
> > > rerun                 FALSE
> > > slots                 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> > >                       [compute-1-1.local=7],[compute-1-3.local=7], \
> > >                       [compute-1-5.local=8],[compute-1-6.local=8], \
> > >                       [compute-1-7.local=8],[compute-1-8.local=8], \
> > >                       [compute-1-9.local=8],[compute-1-10.local=8], \
> > >                       [compute-1-11.local=8],[compute-1-12.local=8], \
> > >                       [compute-1-13.local=8],[compute-1-14.local=8], \
> > >                       [compute-1-15.local=8]
> > > tmpdir                /tmp
> > > shell                 /bin/bash
> > > prolog                NONE
> > > epilog                NONE
> > > shell_start_mode      posix_compliant
> > > starter_method        NONE
> > > suspend_method        NONE
> > > resume_method         NONE
> > > terminate_method      NONE
> > > notify                00:00:60
> > > owner_list            NONE
> > > user_lists            NONE
> > > xuser_lists           NONE
> > > subordinate_list      NONE
> > > complex_values        NONE
> > > projects              NONE
> > > xprojects             NONE
> > > calendar              NONE
> > > initial_state         default
> > > s_rt                  INFINITY
> > > h_rt                  INFINITY
> > > s_cpu                 INFINITY
> > > h_cpu                 INFINITY
> > > s_fsize               INFINITY
> > > h_fsize               INFINITY
> > > s_data                INFINITY
> > > h_data                INFINITY
> > > s_stack               INFINITY
> > > h_stack               INFINITY
> > > s_core                INFINITY
> > > h_core                INFINITY
> > > s_rss                 INFINITY
> > > h_rss                 INFINITY
> > > s_vmem                19G
> > > h_vmem                19G
> > >
> > > [root at chead ~]# qconf -shgrp @basichosts
> > > group_name @basichosts
> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> > >          compute-1-5.local compute-1-6.local compute-1-7.local \
> > >          compute-1-8.local compute-1-9.local compute-1-10.local \
> > >          compute-1-11.local compute-1-12.local compute-1-13.local \
> > >          compute-1-14.local compute-1-15.local compute-2-0.local \
> > >          compute-2-2.local compute-2-5.local compute-2-7.local \
> > >          compute-2-8.local compute-2-9.local compute-2-11.local \
> > >          compute-2-12.local compute-2-13.local compute-2-15.local \
> > >          compute-2-6.local
> > >
> > > Thanks
> > >
> > > -M
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> > >
> >
> >
> >
> > --
> > Best,
> >
> > Feng
> >
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20170906/8adfc792/attachment.html>


More information about the users mailing list