[gridengine users] can't delete an exec host
Michael Stauffer
mgstauff at gmail.com
Wed Sep 6 17:22:04 UTC 2017
On Wed, Sep 6, 2017 at 12:42 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer <mgstauff at gmail.com>:
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.feng at gmail.com> wrote:
> > It seems SGE master did not get refreshed with new hostgroup. Maybe you
> can try:
> > 1. restart SGE master
> > Is it safe to do this with jobs queued and running? I think it's not
> reliable, i.e. jobs can get killed and de-queued?
> Just to mention, that it's safe to restart the qmaster or reboot even the
> machine the qmaster is running on. Nothing will happen to the running jobs
> on the exechosts.
OK good to know. I've done that before and seen them finish, although some
googling suggested people have seen jobs get killed. Does a qmaster
restart, however, empty the queue? I imagine a reboot would too, unless the
queue is stored in a file?
-M
> -- Reuti
> > or
> >
> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> > wait till it gets refreshed; then change it back to "@basichosts".
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm
> still getting the error when I try to delete exec host compute-2-4, and
> qhost is still showing basic.q on the nodes in @basichosts.
> > Interestingly, host compute-2-4 was removed from another queue
> (qlogin.basic.q) that also uses @basichosts, so it's something about
> basic.q that's stuck.
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer <mgstauff at gmail.com>
> wrote:
> > > SoGE 8.1.8
> > >
> > > Hi,
> > >
> > > I'm having trouble deleting an execution host. I've removed it from the
> > > host group, but when I try to delete with qconf, it says it's still
> part of
> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> > >
> > > [root at chead ~]# qconf -de compute-2-4.local
> > > Host object "compute-2-4.local" is still referenced in cluster queue
> > > "basic.q".
> > >
> > > [root at chead ~]# qconf -sq basic.q
> > > qname basic.q
> > > hostlist @basichosts
> > > seq_no 0
> > > load_thresholds np_load_avg=1.74
> > > suspend_thresholds NONE
> > > nsuspend 1
> > > suspend_interval 00:05:00
> > > priority 0
> > > min_cpu_interval 00:05:00
> > > processors UNDEFINED
> > > qtype BATCH
> > > ckpt_list NONE
> > > pe_list make mpich mpi orte unihost serial
> > > rerun FALSE
> > > slots 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> > > [compute-1-1.local=7],[compute-1-3.local=7], \
> > > [compute-1-5.local=8],[compute-1-6.local=8], \
> > > [compute-1-7.local=8],[compute-1-8.local=8], \
> > > [compute-1-9.local=8],[compute-1-10.local=8], \
> > > [compute-1-11.local=8],[compute-1-12.local=8], \
> > > [compute-1-13.local=8],[compute-1-14.local=8], \
> > > [compute-1-15.local=8]
> > > tmpdir /tmp
> > > shell /bin/bash
> > > prolog NONE
> > > epilog NONE
> > > shell_start_mode posix_compliant
> > > starter_method NONE
> > > suspend_method NONE
> > > resume_method NONE
> > > terminate_method NONE
> > > notify 00:00:60
> > > owner_list NONE
> > > user_lists NONE
> > > xuser_lists NONE
> > > subordinate_list NONE
> > > complex_values NONE
> > > projects NONE
> > > xprojects NONE
> > > calendar NONE
> > > initial_state default
> > > s_rt INFINITY
> > > h_rt INFINITY
> > > s_cpu INFINITY
> > > h_cpu INFINITY
> > > s_fsize INFINITY
> > > h_fsize INFINITY
> > > s_data INFINITY
> > > h_data INFINITY
> > > s_stack INFINITY
> > > h_stack INFINITY
> > > s_core INFINITY
> > > h_core INFINITY
> > > s_rss INFINITY
> > > h_rss INFINITY
> > > s_vmem 19G
> > > h_vmem 19G
> > > [root at chead ~]# qconf -shgrp @basichosts
> > > group_name @basichosts
> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> > > compute-1-5.local compute-1-6.local compute-1-7.local \
> > > compute-1-8.local compute-1-9.local compute-1-10.local \
> > > compute-1-11.local compute-1-12.local compute-1-13.local \
> > > compute-1-14.local compute-1-15.local compute-2-0.local \
> > > compute-2-2.local compute-2-5.local compute-2-7.local \
> > > compute-2-8.local compute-2-9.local compute-2-11.local \
> > > compute-2-12.local compute-2-13.local compute-2-15.local \
> > > compute-2-6.local
> > > Thanks
> > >
> > > -M
> > --
> > Best,
> >
> > Feng
> >
