[gridengine users] Qmaster Failing

Rayson Ho rayrayson at gmail.com
Mon May 2 17:29:10 UTC 2011


Hi Brian,

There are a lot of fixes and enhancements done after 6.2u5 by Sun,
Oracle, the 3 forks based on SGE 6.2u5. However, it is a bit hard to
pin point the location of the crash from the strace log -- can you
attach a debugger??

% gdb -q <location of qmaster>
(gdb) attach <pid of qmaster>
(gdb) cont

And when qmaster crashes again, gdb will give you the stack trace.

You may need to run gdb as root.

Rayson



On Mon, May 2, 2011 at 1:23 PM, Murphy, Brian (E IT F 45)
<brian.murphy at siemens.com> wrote:
> Running 6.2u5.
> qmaster running on RHEL 5.4.  Exec host machines running on 5.5/5.6.
> (Currently in upgrade process to 5.6)
> Qmaster keeps dying seemingly randomly (9 times since Friday afternoon.)
> Have not experienced this issue since installing a year ago.
> Problem started a month or so ago and has increased in frequency.
> Currently running a crontab every 2 minutes to check if qmaster is down
> and if so, do a restart.
> I can't find any indication anywhere, e.g., log files etc., as to why it is
> dying.
> So I did an strace on the qmaster PID.
> It shows a segmentation fault (last few lines below.)
> Any ideas?
>
> [pid 24778] futex(0x7375e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
> [pid 24774] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
> [pid 24753] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24744] gettimeofday( <unfinished ...>
> [pid 24743] futex(0x2b662bd40c24, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647,
> 0x2b662bd40bc0, 7404026 <unfinished ...>
> [pid 24778] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
> unavailable)
> [pid 24776] <... futex resumed> )       = 0
> [pid 24774] <... clock_gettime resumed> {1304038113, 8112000}) = 0
> [pid 24753] <... futex resumed> )       = 0
> [pid 24744] <... gettimeofday resumed> {1304038113, 8320}, NULL) = 0
> [pid 24743] <... futex resumed> )       = 2
> [pid 24778] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24776] futex(0x2b662bd40bc0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished
> ...>
> [pid 24774] futex(0x2aaaabc5aa0c, FUTEX_WAIT_PRIVATE, 2519512, {0,
> 998853000} <unfinished ...>
> [pid 24753] gettimeofday( <unfinished ...>
> [pid 24744] futex(0x2b662bd409e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24743] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24778] <... futex resumed> )       = 0
> [pid 24777] <... futex resumed> )       = 0
> [pid 24776] <... futex resumed> )       = -1 EAGAIN (Resource temporarily
> unavailable)
> [pid 24753] <... gettimeofday resumed> {1304038113, 9573}, {0, 1304038113})
> = 0
> [pid 24744] <... futex resumed> )       = 0
> [pid 24743] <... futex resumed> )       = 1
> [pid 24778] gettimeofday( <unfinished ...>
> [pid 24777] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24776] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
> [pid 24753] gettimeofday( <unfinished ...>
> [pid 24744] poll([{fd=38, events=POLLOUT}], 1, 5 <unfinished ...>
> [pid 24743] gettimeofday( <unfinished ...>
> [pid 24778] <... gettimeofday resumed> {1304038113, 10670}, {0, 1304038113})
> = 0
> [pid 24777] <... futex resumed> )       = 0
> [pid 24776] <... futex resumed> )       = 0
> [pid 24753] <... gettimeofday resumed> {1304038113, 11054}, NULL) = 0
> [pid 24744] <... poll resumed> )        = 1 ([{fd=38, revents=POLLOUT}])
> [pid 24743] <... gettimeofday resumed> {1304038113, 11228}, NULL) = 0
> [pid 24778] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> Process 24778 detached
> [pid 24794] +++ killed by SIGSEGV +++
> [pid 24793] +++ killed by SIGSEGV +++
> [pid 24790] +++ killed by SIGSEGV +++
> [pid 24789] +++ killed by SIGSEGV +++
> [pid 24788] +++ killed by SIGSEGV +++
> [pid 24787] +++ killed by SIGSEGV +++
> [pid 24786] +++ killed by SIGSEGV +++
> [pid 24785] +++ killed by SIGSEGV +++
> [pid 24784] +++ killed by SIGSEGV +++
> [pid 24783] +++ killed by SIGSEGV +++
> [pid 24782] +++ killed by SIGSEGV +++
> [pid 24781] +++ killed by SIGSEGV +++
> [pid 24780] +++ killed by SIGSEGV +++
> [pid 24779] +++ killed by SIGSEGV +++
> [pid 24777] +++ killed by SIGSEGV +++
> [pid 24776] +++ killed by SIGSEGV +++
> [pid 24774] +++ killed by SIGSEGV +++
> [pid 24755] +++ killed by SIGSEGV +++
> [pid 24754] +++ killed by SIGSEGV +++
> [pid 24753] +++ killed by SIGSEGV +++
> [pid 24752] +++ killed by SIGSEGV +++
> [pid 24744] +++ killed by SIGSEGV +++
> [pid 24743] +++ killed by SIGSEGV +++
> [pid 24742] +++ killed by SIGSEGV +++
> [pid 24740] +++ killed by SIGSEGV +++
> +++ killed by SIGSEGV +++
>
>
> Best Regards,
> Brian Murphy
> ________________________________________
> Siemens Energy, Inc.
> Global Engineering Computing Operations
> Engineering Applications Administrator
> Compute Grid Administrator
> Orlando, Florida, USA
> 407.736.5215
>
> _______________________________________________
> users mailing list
> users at gridengine.org
> https://gridengine.org/mailman/listinfo/users
>
>




More information about the users mailing list