[gridengine users] execd errors on SLES 11.2 / 11.3

Sve N goretoffel at hotmail.de
Mon Sep 22 10:11:13 UTC 2014

Dear gridengine users,
some months ago I wrote about this problem, I tried around a bit more, now, and at least have a workaround:

When I submit jobs to my SLES 11.3 execd-hosts, they do the first one fine. If some time (ranging from under one second, to several seconds) has passed and another job is submitted (independent of whether the first job is still running, or not), the queue falls into the error state, finishing the first job correctly, but not accepting any new jobs. Just clearing the error does not help, the execd-daemon has to be restarted, too.

I have some SLES 11.1 machines, where this error does not occur. I use the same configuration for both systems, so I don't think it has anything to do with that. It occurs on just updated machines (11.1->11.3), newly installed (11.3, as well as 11.2, which I tried once). I normally use the precompiled packages, but I tried a self compiled version (compiled directly on a 11.3 machine), I also tried the latest version of "son of gridengine", which shows the same error, and, as our sgemaster runs an older ubuntu version (which worked fine for several years), tried to master from a new ubuntu-machine, which didn't help, either.

If someone thinks it helps, I can provide some strace-output, but it is a bit long, so I won't here, most important probably are:

09/15/2014 13:10:10|  main|host-05|E|shepherd of job 821.1 died through signal = 11
09/15/2014 13:10:10|  main|host-05|E|abnormal termination of shepherd for job 821.1: no "exit_status" file
09/15/2014 13:10:10|  main|host-05|E|can't open file active_jobs/821.1/error: Datei oder Verzeichnis nicht gefunden
09/15/2014 13:10:10|  main|host-05|E|can't open pid file "active_jobs/821.1/pid" for job 821.1
(sometimes the signal is 6)

Sep 15 13:10:09 host-05 kernel: [18105718.966831] sge_execd[5295]: segfault at 7ffea8000000 ip 00007ffea93f44f9 sp 00007fff2ac2e2d0 error 4 in libc-2.11.3.so[7ffea937c000+16f000]
(I tried copying the old libc-2.11.3.so, and linking it for sge, but it didn't work.)

A workaround I found, trying to search for the error, is running execd through valgrind.
(Just install valgrind and change 

exec 1>/dev/null 2>&1


exec 1>/dev/null 2>&1
valgrindpath/valgrind $bin_dir/sge_execd

at around line 347 in the startup-script.)
If you keep the output, valgrind reports a ton, don't know, if everything is connected to my error. This, then, doesn't show up anymore, valgrind somehow seems to fix it. I don't know, if this is an error of GE, or SLES, maybe I should write on some SLES-board, too.

I won't try to solve this in another way, but if someone would like to see the valgrind output, or so, tell me.
Best regards, Sven
