[gridengine users] Rocks+SGE - execd up, no shepherds or queues

William Hay w.hay at ucl.ac.uk
Fri Jul 5 15:38:26 UTC 2013


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/07/13 16:11, Samir Cury wrote:
> Hi William,
> 
> Thanks for the comments. It helped me to find problems/cleanup a
> bit the system. I realized that all -31- nodes were deprecated for
> long and just hanging "orphans" there. We also had jobs stuck on
> those entities that I removed.
> 
> One funny behavior I noticed is even if I issue qconf -de for those
> nodes, it only takes effect everywhere once I restart the master.
> (compute-2-4 is down, and this is fine)
> 
> Still, the main problem might be unrelated, I have hosts that
> appear in qhost -q   [1]
> 
> But although their daemons are running fine, they don't show up in
> qstat -f  [2]
> 
> Or seem to serve slots to any queue, although they appear
> everywhere in the configuration. I will share a bit of it here :
> 
> ------------- qconf -sq all.q qname                 all.q hostlist
> @allhosts slots
> 1,[compute-2-4.local=8],[compute-3-2.local=8], \ 
> [compute-3-3.local=8],[compute-3-4.local=8], \ 
> [compute-3-6.local=8],[compute-3-5.local=8], \ 
> [compute-3-7.local=8],[compute-3-8.local=8], \ 
> [compute-3-9.local=8],[compute-3-10.local=8], \ 
> [compute-3-12.local=8],[compute-3-11.local=8], \ 
> [t3-higgs.ext.domain=4],[compute-30-1.local=40]
> 
> ------------- qconf -mhgrp @allhosts group_name @allhosts hostlist
> t3-higgs.ultralight.org<http://t3-higgs.ultralight.org>
> compute-3-7.local compute-2-4.local \ compute-3-3.local
> compute-3-4.local compute-3-6.local \ compute-3-8.local
> compute-3-9.local compute-3-10.local \ compute-3-11.local
> compute-3-12.local compute-3-2.local \ compute-2-4.local
> compute-30-1.local compute-3-5.local
> 
> I think it just comes back to the FUTEX timeout, seems the only
> difference I've seen between a working and non-working node. Let me
> know if you have clues of what else to check. Network settings seem
> to be the same in a working and non-working node.
> 
I suspect this is a red herring -effect rather than cause.
> Thanks, Samir
> 
Your qhost -q command looks like it doesn't think there are queue
instances there.  For most purposes cluster queues are just a way of
creating queue instances en-masse.  If they get out of sync then it is
the qinstances that count.

If you are using classic spool have a look in
$SGE_ROOT/$SGE_CELL/spool/qinstances/all.q to see if there are files
named after the nodes there.

If they're missing try making a dummy change to the cluster queue and
the hostgroup to force re-creation of the qinstances.

If you're not using classic spool this may be different but there's
probably a similar database of qinstances somewhere in the spool.

William

> 
> [1] : [root at compute-3-5 ~]# qhost -q HOSTNAME                ARCH
> NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS 
> -------------------------------------------------------------------------------
>
> 
global                  -               -     -       -       -       -
      -
> compute-2-2             lx26-amd64      8     -   23.5G       -
> 4.0G       - compute-2-4             lx26-amd64      8     -
> 23.5G       -    4.0G       - all.q                BIP   0/0/8
> au compute-3-10            lx26-amd64      8  0.03   23.5G  847.6M
> 4.0G  196.0K compute-3-11            lx26-amd64      8  0.04
> 23.5G  742.7M    4.0G  196.0K compute-3-12            lx26-amd64
> 8  0.00   23.5G    1.0G    4.0G  196.0K compute-3-2
> lx26-amd64      8  0.06   23.5G  821.3M    4.0G  196.0K all.q
> BIP   0/0/8 compute-3-3             lx26-amd64      8  0.00   23.5G
> 927.4M    4.0G  196.0K compute-3-4             lx26-amd64      8
> 0.00   23.5G  617.4M    4.0G   24.6M compute-3-5
> lx26-amd64      8  0.10   23.5G    1.4G    4.0G     0.0 compute-3-6
> lx26-amd64     16  0.17   23.5G  869.3M    4.0G  260.0K compute-3-7
> lx26-amd64      8  0.00   23.5G  741.6M    4.0G   39.5M all.q
> BIP   0/0/8 compute-3-8             lx26-amd64      8  0.00   23.5G
> 668.8M    4.0G   24.1M all.q                BIP   0/0/8 compute-3-9
> lx26-amd64      8  0.02   23.5G  670.4M    4.0G  196.0K 
> compute-30-1            lx26-amd64     80  0.04   62.9G    1.7G
> 4.0G   38.2M t3-higgs                lx26-amd64      8  0.00
> 23.5G    1.3G    4.0G    4.5M all.q                BIP   0/0/4
> 
> [2] : [root at compute-3-5 ~]# qstat -f queuename
> qtype resv/used/tot. load_avg arch          states 
> ---------------------------------------------------------------------------------
>
> 
all.q at compute-2-4.local        BIP   0/0/8          -NA-     lx26-amd64
   au
> ---------------------------------------------------------------------------------
>
> 
all.q at compute-3-2.local        BIP   0/0/8          0.05     lx26-amd64
> ---------------------------------------------------------------------------------
>
> 
all.q at compute-3-7.local        BIP   0/0/8          0.00     lx26-amd64
> ---------------------------------------------------------------------------------
>
> 
all.q at compute-3-8.local        BIP   0/0/8          0.00     lx26-amd64
> ---------------------------------------------------------------------------------
>
> 
all.q at t3-higgs.ext.domain  BIP   0/0/4          0.00     lx26-amd64
> 
> On Wed, Jul 3, 2013 at 9:50 AM, William Hay
> <w.hay at ucl.ac.uk<mailto:w.hay at ucl.ac.uk>> wrote: On Tue, 2013-07-02
> at 13:41 +0000, Samir Cury wrote:
>> Dear all,
>> 
>> Our setup is the SGE that comes in a Rocks Roll, in principle
>> already automated/OOTB process to deploy it in the
>> headnode/compute nodes with their respective roles.
>> 
>> Since our headnode's motherboard was replaced (in principle only 
>> affects MAC address change for eth0,eth1), we have been facing
>> some problems with our SGE setup, I'd like to share the tests we
>> did so far, and if possible get some advice on what other tests
>> can be done to find the problem.
> 
>> [root at t3-local ~]# qstat -f queuename                      qtype
>> resv/used/tot. load_avg arch states 
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-2-4.local        BIP   0/0/8          -NA-     lx26-amd64
   au
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-3-2.local        BIP   0/8/8          0.05     lx26-amd64
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-3-7.local        BIP   0/8/8          0.09     lx26-amd64
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-3-8.local        BIP   0/8/8          0.05     lx26-amd64
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-31-1.local       BIP   0/16/1         -NA-     lx26-amd64
   auo
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-31-3.local       BIP   0/16/1         -NA-     lx26-amd64
   auo
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at compute-31-4.local       BIP   0/16/1         -NA-     lx26-amd64
   auo
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at t3-higgs.ext.domain  BIP   0/0/4          0.09     lx26-amd64
>> 
>> 
> The queue instances with 'o' in their state field are not
> configured to exist as far as grid engine is concerned and are
> merely being retained until the last job running in them finishes.
> This is probably not what you want.
> 
> I've seen occasions in the past where the queue instances don't
> match up with what is configured in the cluster queue.
> 
> The problem may have manifested now because you've turned off the 
> qmaster for the first time in (presumably) a long while and the on
> disk config doesn't quite match up with what was in memory prior to
> the outage.
> 
> If this is the case you could possibly get them reconfigured by
> issuing a qconf -mq all.q making a trivial change (IIRC adding a
> space at the end of a line is sufficient) and saving.
> 
> It may not help but it shouldn't hurt.
> 
> If the queues don't lose at least the 'o' state then examine the
> output of qconf -sq all.q |grep '^hostlist' to see if the cluster
> queue indicated they should be there.
> 
> Also qconf -sq all.q|grep '^slots' as you appear to have more
> slots running there than you have 'configured'
> 
> compute-2-4.local is something else though (maybe just sge_execd
> down).
> 
> 
> William
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBAgAGBQJR1uhyAAoJEKCzH4joEjNWdkoQAJl3dYK36Owv76AO9xjsnnjH
2r8N9O+8zJTv/fDULTDSP55uexmQL5jRnUJRyHe4xpIopYFYAI124ya1fMobbuey
r5VKIjG/RLSnx8ptE5NyAqJ609OJNcelcDYi0cnuWf8R+EHaTR7gleWonNw28QHl
raAK5OT6SN9l2dJn88/rhg4bn/ZkittxtZiXBJhp6FCHX3GgxyEMI9PHbPGd4q1f
ImPihCWUTHTxkhqwCTxFFCJfB+i7yno3d/D1XBkRiVDm9L21fEU+ht4xxsbFG0t/
EHPE3/EYsGUA7105SsNqHDuLUKtrHRprl+d+ky4O3iNWJ/8Fb+H3RDTOJ9ZBasAo
xhnQSWgcjST8W0qfC5oP3lGFgA6Zfs2DmLGT7BfJJk2nflEXwY3XUhkT5bTPsMcz
tOspfgnY2EkF7KPNPW5WXGawcXDj0yM1pJkuohaU5nWN5k7KnXA374FxRKJ0SfaM
+J9uacnpeLbjsow8CnMS8P49K+Z4GbS/YW+D3K6UrBzJNDIhSlASbHMl7nLlOfXF
oEiIerNs0UFwRgF4PLyTOPvFAsjrfXAUs1gl7j7JiDjnxc/KF77Ji5L4CLuM+2yT
cD6iLJZyFlsGhlKG8aYMcgGqzoteA/AOspcxM5XUFi/+f9YntAEgX3Xz0hVf6ef1
gHV9Y96u0Myj3F80IQjf
=txP4
-----END PGP SIGNATURE-----




More information about the users mailing list