[gridengine users] Fwd: Re: Rocks+SGE - execd up, no shepherds or queues

Samir Cury samir at hep.caltech.edu
Mon Jul 8 15:34:31 UTC 2013


Thanks a lot, William! It seems that removing them just to re-insert
recreate the files and with this hosts come back to the queue :
===================================================

compute-2-4             lx26-amd64      8     -   23.5G       -    4.0G
  -
   all.q                BIP   0/0/8         au
compute-3-10            lx26-amd64      8  0.00   23.5G    1.2G    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-11            lx26-amd64      8  0.03   23.5G  661.5M    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-12            lx26-amd64      8  0.02   23.5G  748.8M    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-2             lx26-amd64      8  0.11   23.5G  657.9M    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-3             lx26-amd64      8  0.06   23.5G    1.3G    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-3-4             lx26-amd64      8  0.02   23.5G  599.6M    4.0G
24.6M
   all.q                BIP   0/0/8
compute-3-5             lx26-amd64      8  0.01   23.5G    1.4G    4.0G
0.0
   all.q                BIP   0/0/8
compute-3-6             lx26-amd64     16  0.03   23.5G  728.4M    4.0G
 260.0K
   all.q                BIP   0/0/8
compute-3-7             lx26-amd64      8  0.00   23.5G  590.6M    4.0G
39.5M
   all.q                BIP   0/0/8
compute-3-8             lx26-amd64      8  0.04   23.5G  666.6M    4.0G
24.1M
   all.q                BIP   0/0/8
compute-3-9             lx26-amd64      8  0.01   23.5G  635.1M    4.0G
 196.0K
   all.q                BIP   0/0/8
compute-30-1            lx26-amd64     80  0.01   62.9G    1.9G    4.0G
38.2M
   all.q                BIP   0/0/40

all.q at compute-3-10.local       BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-11.local       BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-12.local       BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-2.local        BIP   0/0/8          0.04     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-3.local        BIP   0/0/8          0.04     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-4.local        BIP   0/0/8          0.06     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-5.local        BIP   0/0/8          0.01     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-6.local        BIP   0/0/8          0.00     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-7.local        BIP   0/0/8          0.02     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-8.local        BIP   0/0/8          0.03     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-3-9.local        BIP   0/0/8          0.02     lx26-amd64
---------------------------------------------------------------------------------
all.q at compute-30-1.local       BIP   0/0/40         0.02     lx26-amd64


On Mon, Jul 8, 2013 at 9:14 AM, William Hay <w.hay at ucl.ac.uk> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Oops forgor to send to list...
>
> - -------- Original Message --------
> Subject: Re: [gridengine users] Rocks+SGE - execd up, no shepherds or
> queues
> Date: Mon, 08 Jul 2013 08:12:31 +0100
> From: William Hay <w.hay at ucl.ac.uk>
> To: Samir Cury <samir at hep.caltech.edu>
>
> On 05/07/13 20:07, Samir Cury wrote:
> > Hi William,
> >
> > Thanks for the directions, I tried changing the queue
> > configuration and host group configuration, issuing or not a
> > restart on the master and exec nodes, but not much changes.
> >
> > Yes, we're using the spool, looking closer to it :
> >
> > /opt/gridengine/default/spool/qmaster/qinstances/all.q
> > [root at t3-local all.q]# ll total 68 -rw-r--r-- 1 sge sge  223 Jun
> > 16 2012 compute-2-2.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-2-4.local -rw-r--r-- 1 sge sge  225 Oct 15  2012
> > compute-30-1.local -rw-r--r-- 1 sge sge  224 Jun 16  2012
> > compute-3-10.local -rw-r--r-- 1 sge sge  224 Jun 16  2012
> > compute-3-11.local -rw-r--r-- 1 sge sge  224 Jun 16  2012
> > compute-3-12.local -rw-r--r-- 1 sge sge  227 Sep 27  2012
> > compute-31-2.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-2.local -rw-r--r-- 1 sge sge  223 Nov 20  2012
> > compute-3-3.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-4.local -rw-r--r-- 1 sge sge  223 Jul  5 10:23
> > compute-3-5.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-6.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-7.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-8.local -rw-r--r-- 1 sge sge  223 Jun 16  2012
> > compute-3-9.local -rw-r--r-- 1 sge sge 2000 Sep 24  2012 ss
> > -rw-r--r-- 1 sge sge  229 Jun 16  2012 t3-higgs.ext.domain
> >
> > It looks good, and the most surprising is that the diff between
> > compute-3-5 (not working) and compute-3-7 (working) is the
> > "version 7" and "version 5" attributes. Not sure what it is (file
> > serial number maybe) but doesn't look very meaningful as other
> > hosts have different numbers (up to 12).
> >
> > I tried a bit of the obvious, moving the all.q directory to a
> > backup name and restart the master to see if it recreates it
> > correctly. Nope. It only got all my hosts missing. However, if I
> > alter the queue "in memory" it recreates an empty "all.q"
> > directory.
> >
> > Something I realized while trying other procedures is :
> >
> > [root at t3-local all.q]# qmod -e all.q Queue instance
> > "all.q at compute-3-2.local" is already in the specified state:
> > enabled Queue instance "all.q at compute-2-4.local" is already in the
> > specified state: enabled Queue instance
> > "all.q at t3-higgs.ext.domain" is already in the specified state:
> > enabled Queue instance "all.q at compute-3-7.local" is already in the
> > specified state: enabled Queue instance "all.q at compute-3-8.local"
> > is already in the specified state: enabled
> >
> > Meaning that although the hostgroup @allhosts looks what we want,
> > qmod is only considering those nodes for some reason.
> >
> > Maybe the question now is -- what makes those nodes to be
> > considered by qstat and qmod, and how to include(or force) them
> > into this list.
> >
> > To isolate a hostgroup problem, I copied the list from qconf
> > -mhgroup @allhosts directly in all.q's hostlist, but not luck
> > either.
> >
> > Any idea on how to actually regenerate the all.q files in spool?
> > That seems to be the way. Summarizing :
> >
> Possibly you could delete the missing hosts from the queue/hostgroup,
> save it then re-add them?
>
> William
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with undefined - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJR2ma4AAoJEKCzH4joEjNWAYwQAISEDI3Wkmih2I3Uf1qSkxXI
> 3r8MlMML3vVB0gx/UH6l9nEze4FNKPfFnOME9JwMt7LbZbqHemDyx62sEkLhAMSr
> juIiSQdUPGZaRphl8bZgXaIW+ihEjWURUXnGotoANdDE9MF+StpEkg5SJIyXoIOC
> MxVp9soPLc29UolI3LcpaCnBPqTE1wHcl/+qiQdEyyCIlOJ7v8oLi9fCSPvPIp6n
> UKgT7fXVQAtqQl3GDSL61YXMnZtWPGI3RUfKpmdzKz6V6CxEeWIkge7VtVz8u/Nn
> ICq/txOGzIcJf0jFTcZjf5YzID2O26BIVJCFlkrsyOOYBzyh6Z4Dz75eykLos1xB
> O6k34HFuecH7rqDGINBp+BSP4+MhPPWNav2GyBhKpIlgOWieq9YRsZh0q2Qzngo3
> 2Gur2wLCWb6NnTNmGImPLSstxc4zphXhdWaPKNTaxYMQrtsSiKhR69DCP+b6TBdA
> lULsDIxAo5a2kP/Ea5bP7jT91tuPeEAk69YLvj8KfwYlLq6NLYiQBqL+e/yL+iri
> 5FoJelSBgCPlcrgfllK8eL1MIvU/oaD7G2X/P5DXFxLu759l+yrSVvidik+BUpGK
> 8rlGHCxSylqzgjnxbICEawuik9i0ZIwqDnWsP6m7/gS+qFrApFXCa5nkH176uh7h
> 1f7ThRun3Mw0kKoFXZSs
> =fGAV
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20130708/5a8d4df4/attachment.html>


More information about the users mailing list