[gridengine users] can't delete an exec host

Reuti reuti at staff.uni-marburg.de
Wed Sep 6 18:12:25 UTC 2017


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Am 06.09.2017 um 19:22 schrieb Michael Stauffer:

> On Wed, Sep 6, 2017 at 12:42 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> 
> > Am 06.09.2017 um 17:33 schrieb Michael Stauffer <mgstauff at gmail.com>:
> >
> > On Wed, Sep 6, 2017 at 11:16 AM, Feng Zhang <prod.feng at gmail.com> wrote:
> > It seems SGE master did not get refreshed with new hostgroup. Maybe you can try:
> >
> > 1. restart SGE master
> >
> > Is it safe to do this with jobs queued and running? I think it's not reliable, i.e. jobs can get killed and de-queued?
> 
> Just to mention, that it's safe to restart the qmaster or reboot even the machine the qmaster is running on. Nothing will happen to the running jobs on the exechosts.
> 
> OK good to know. I've done that before and seen them finish, although some googling suggested people have seen jobs get killed.

No. 

NB: They will get killed, in case you shut down the "sgeexecd" on an exechost with the conventional "stop" as argument though. Supplying the argument "softstop" instead will allow them to continue, although no longer being supervised by the "sgeexed" any longer. Sometimes this can be handy, in case a user gave an expected h_rt which is too short for the job and it's necessary to grant the job to continue to run.


> Does a qmaster restart, however, empty the queue?

No.


>  I imagine a reboot would too, unless the queue is stored in a file?

All vital information is stored in flat files or BDB. The only thing which is lost, are the completed jobs aka zombies (which you can see with `qstat -s z`, the number of them can be set with `qconf -mconf` entry "finished_jobs").

- -- Reuti


> 
> -M
>  
> 
> -- Reuti
> 
> 
> > or
> >
> > 2. change basic.q, "hostlist" to any node, like "compute-1-0.local",
> > wait till it gets refreshed; then change it back to "@basichosts".
> >
> > I've done this, but it's not refreshing (been about 10 minutes now). I'm still getting the error when I try to delete exec host compute-2-4, and qhost is still showing basic.q on the nodes in @basichosts.
> >
> > Interestingly, host compute-2-4 was removed from another queue (qlogin.basic.q) that also uses @basichosts, so it's something about basic.q that's stuck.
> >
> > Is there some way to refresh things other than restarting qmaster?
> >
> > -M
> >
> >
> >
> >
> >
> > On Wed, Sep 6, 2017 at 10:29 AM, Michael Stauffer <mgstauff at gmail.com> wrote:
> > > SoGE 8.1.8
> > >
> > > Hi,
> > >
> > > I'm having trouble deleting an execution host. I've removed it from the
> > > host group, but when I try to delete with qconf, it says it's still part of
> > > 'basic.q'. Here's the relevant output? Anyone have any suggestions?
> > >
> > > [root at chead ~]# qconf -de compute-2-4.local
> > > Host object "compute-2-4.local" is still referenced in cluster queue
> > > "basic.q".
> > >
> > > [root at chead ~]# qconf -sq basic.q
> > > qname                 basic.q
> > > hostlist              @basichosts
> > > seq_no                0
> > > load_thresholds       np_load_avg=1.74
> > > suspend_thresholds    NONE
> > > nsuspend              1
> > > suspend_interval      00:05:00
> > > priority              0
> > > min_cpu_interval      00:05:00
> > > processors            UNDEFINED
> > > qtype                 BATCH
> > > ckpt_list             NONE
> > > pe_list               make mpich mpi orte unihost serial
> > > rerun                 FALSE
> > > slots                 8,[compute-1-2.local=3],[compute-1-0.local=7], \
> > >                       [compute-1-1.local=7],[compute-1-3.local=7], \
> > >                       [compute-1-5.local=8],[compute-1-6.local=8], \
> > >                       [compute-1-7.local=8],[compute-1-8.local=8], \
> > >                       [compute-1-9.local=8],[compute-1-10.local=8], \
> > >                       [compute-1-11.local=8],[compute-1-12.local=8], \
> > >                       [compute-1-13.local=8],[compute-1-14.local=8], \
> > >                       [compute-1-15.local=8]
> > > tmpdir                /tmp
> > > shell                 /bin/bash
> > > prolog                NONE
> > > epilog                NONE
> > > shell_start_mode      posix_compliant
> > > starter_method        NONE
> > > suspend_method        NONE
> > > resume_method         NONE
> > > terminate_method      NONE
> > > notify                00:00:60
> > > owner_list            NONE
> > > user_lists            NONE
> > > xuser_lists           NONE
> > > subordinate_list      NONE
> > > complex_values        NONE
> > > projects              NONE
> > > xprojects             NONE
> > > calendar              NONE
> > > initial_state         default
> > > s_rt                  INFINITY
> > > h_rt                  INFINITY
> > > s_cpu                 INFINITY
> > > h_cpu                 INFINITY
> > > s_fsize               INFINITY
> > > h_fsize               INFINITY
> > > s_data                INFINITY
> > > h_data                INFINITY
> > > s_stack               INFINITY
> > > h_stack               INFINITY
> > > s_core                INFINITY
> > > h_core                INFINITY
> > > s_rss                 INFINITY
> > > h_rss                 INFINITY
> > > s_vmem                19G
> > > h_vmem                19G
> > >
> > > [root at chead ~]# qconf -shgrp @basichosts
> > > group_name @basichosts
> > > hostlist compute-1-0.local compute-1-2.local compute-1-3.local \
> > >          compute-1-5.local compute-1-6.local compute-1-7.local \
> > >          compute-1-8.local compute-1-9.local compute-1-10.local \
> > >          compute-1-11.local compute-1-12.local compute-1-13.local \
> > >          compute-1-14.local compute-1-15.local compute-2-0.local \
> > >          compute-2-2.local compute-2-5.local compute-2-7.local \
> > >          compute-2-8.local compute-2-9.local compute-2-11.local \
> > >          compute-2-12.local compute-2-13.local compute-2-15.local \
> > >          compute-2-6.local
> > >
> > > Thanks
> > >
> > > -M
> > >
> > > _______________________________________________
> > > users mailing list
> > > users at gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> > >
> >
> >
> >
> > --
> > Best,
> >
> > Feng
> >
> > _______________________________________________
> > users mailing list
> > users at gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 
> 

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAlmwOooACgkQo/GbGkBRnRo0eACgjv4C/9Jm9aJedEkFPVtwXRuo
c7gAmgPcf27XTgd8SnjKMh2Hhz4gl5P2
=Tbbi
-----END PGP SIGNATURE-----




More information about the users mailing list