[gridengine users] Commlib Problem with GDI Communication from and to qmaster

A. Podstawka adam.podstawka at dsmz.de
Thu Jul 18 08:51:15 UTC 2013


Hi,

i have an fresh installed cluster, compiling and making the debian
package (but using it under ubuntu server) went well, qmaster is
running, and the local sge_execd (it seems so) can communicate with with
qmaster, but the other nodes not.

i have run the inst_sge -x command on all nodes, and everything went
fine except one thing: starting sge_execd gets me an errormessage:
Starting execution daemon. Please wait ...
ERROR: failed receiving gdi request response for mid=1 (got syncron
message receive timeout error).
    starting sge_execd


it seems so that something with this "GDI" Communication protocol isn't
going well, but my understanding of this is extremely low.

Here some output by rising the debuglevel for qmaster:
   5229   5279  listener000     commlib error: got read error (closing
"erwin2.bioinfodsmz.local/execd/1")  5230   5279 event_master     EVENT
UPDATE FUNCTION event_update_func() HAS BEEN TRIGGERED
   5231   5279 event_master     processing event master request: 4
   5232   5279 scheduler000     passed cancelation point
   5233   5279 event_master     processing event master request: 1
   5234   5279     timer000     te_scan_table_and_deliver: event (t:16
w:1374134561 m:2 s:security-event)
   5235   5279     timer000     te_scan_table_and_deliver: reccuring
event (t:16 w:1374134570 m:2 s:security-event)
   5236   5279     timer000     te_add_event: (t:16 w:1374134570 m:2
s:security-event)
   5237   5279     timer000     te_wait_next: time:1374134560
next:1374134565 --> will wait
   5238   5279  listener001       5239   5279  listener000     commlib
error: got read error (closing "erwin2.bioinfodsmz.local/execd/1")  5240
   5279     timer000     te_scan_table_and_deliver: event (t:9
w:1374134565 m:1 s:(null))
   5241   5279     timer000     te_add_event: (t:9 w:1374134579 m:1
s:(null))
   5242   5279     timer000     te_wait_next: time:1374134564
next:1374134565 --> will wait
   5243   5279     timer000     te_scan_table_and_deliver: event (t:18
w:1374134565 m:1 s:(null))
   5244   5279     timer000     te_add_event: (t:18 w:1374134579 m:1
s:(null))
   5245   5279     timer000     te_wait_next: time:1374134564
next:1374134565 --> will wait
   5246   5279     timer000     te_scan_table_and_deliver: event (t:12
w:1374134565 m:2 s:load-value-cleanup)
   5247   5279     timer000     te_scan_table_and_deliver: reccuring
event (t:12 w:1374134579 m:2 s:load-value-cleanup)
   5248   5279     timer000     te_add_event: (t:12 w:1374134579 m:2
s:load-value-cleanup)
   5249   5279     timer000     te_wait_next: time:1374134564
next:1374134565 --> will wait
   5250   5279     timer000     te_scan_table_and_deliver: event (t:19
w:1374134565 m:2 s:ar_id_changed)
   5251   5279     timer000     te_scan_table_and_deliver: reccuring
event (t:19 w:1374134579 m:2 s:ar_id_changed)
   5252   5279     timer000     te_add_event: (t:19 w:1374134579 m:2
s:ar_id_changed)
   5253   5279     timer000     te_wait_next: time:1374134564
next:1374134566 --> will wait
   5254   5279     timer000     te_scan_table_and_deliver: event (t:17
w:1374134566 m:2 s:job_number_changed)
   5255   5279     timer000     te_scan_table_and_deliver: reccuring
event (t:17 w:1374134580 m:2 s:job_number_changed)
   5256   5279     timer000     te_add_event: (t:17 w:1374134580 m:2
s:job_number_changed)
   5257   5279     timer000     te_wait_next: time:1374134565
next:1374134570 --> will wait
   5258   5279  listener001       5259   5279  listener000     commlib
error: got read error (closing "erwin2.bioinfodsmz.local/execd/1")  5260
   5279 scheduler000     pthread_cond_timedwait for events failed 110
   5260   5279     timer000     te_scan_table_and_deliver: event (t:16
w:1374134570 m:2 s:security-event)
   5262   5279     timer000     te_scan_table_and_deliver: reccuring
event (t:16 w:1374134579 m:2 s:security-event)
   5263   5279     timer000     te_add_event: (t:16 w:1374134579 m:2
s:security-event)
   5264   5279     timer000     te_wait_next: time:1374134569
next:1374134574 --> will wait


i need realy help in this, since the network connection is working well.

all machines can reach each other through ip AND hostname and the
specified domainname.
ports gets connected when i telnet erwin2.bioinfodsmz.local 6445 and
telnet erwin.bioinfodsmz.local 6444

erwin is qmaster, erwin2(-8) are the execd nodes.

please help me

Thanks
Adam

-- 
Adam Podstawka
Leibniz-Institut DSMZ-Deutsche Sammlung von Mikro-
organismen und Zellkulturen GmbH
http://www.dsmz.de

Director: Prof. Dr. Jörg Overmann
Local court: Braunschweig HRB 2570
Chairman of the supervisory board: MR Dr. Axel Kollatschny

DSMZ - A member of the Leibniz Association (WGL)




More information about the users mailing list