[gridengine users] SoGE 8.1.2 segfault problem

Loong, Andreas Andreas.Loong at astrazeneca.com
Thu Nov 1 07:28:49 UTC 2012


> > I've recently installed SoGE 8.1.2 on a multihomed qmaster. 
> > It's a clean install, using the packages from SoGE project 
> > page. This is meant to replace the existing SGE 6.2u5 
> > version we have running today.
> 
> [It shouldn't have needed a new install -- you can do a live
> upgrade.]

Noted, we opted for this to have a side-by-side installation for testing
purposes. Now I'm kind of glad we did it like that.

> > We have an internal bind which xCAT manages.
> 
> Why dnsmasq and bind (not that it should be relevant to the crash)?

For its ability to cache results for long periods of time in the event
of a failure, mainly. This keeps the cluster in normal operation for the
most part, even after there's a problem with the DNS server.
 
> > Once the qmaster starts up, I see this in the messages-file:
> > 10/31/2012 09:48:11|  main|srvname|W|local configuration srvname
> not defined - using global configuration
> 
> I'm confused.  I thought it crashed immediately, or is the above
> with
> the NSS change?  Otherwise, what are the last messages before the
> crash
> (preferably with log_Level info)?

My fault for not being more clear. It starts up just fine and everything
I'd expect works just as our old qmaster did. After approx 2 minutes it
segfaults without anything new or even odd added to the messages file.
If I change NSS to files, it stops segfaulting altogether and we can
almost make use of it.

I've included the log you requested at the end of the mail.

> > As soon as I change back from pure files to "files dns" it takes
> > 2-3 minutes and the qmaster segfaults again.
> 
> Do you mean qmaster runs for that long, or the init script waits
> that
> long for it?  What do you get with and without dns in NSS and
> flushing
> the nscd cache from
> 
>   utilbin/lx-amd64/gethostbyname -all srvname

I tried as many options I could think of to get differing results, but
the output never changed.

> > It might be worth noting that this host is an SGE 6.2u5 qmaster
> usually, with the original configuration of the resolver, it works
> without problems.
> 
> Are the execds always the same version as the qmaster (although I'd
> expect something different if not)?

I tried it with differing versions but the qmaster noted it in the logs
and then the segfaults made me believe there was an incompatibility, so
I've kept them at the same version as the qmaster from early on in the
process.

messages-output below (I have stripped our internal domain from this
log):
11/01/2012 07:26:47|  main|srvname|W|local configuration srvname not
defined - using global configuration
11/01/2012 07:26:47|  main|srvname|I|using "/var/spool/sge" for
execd_spool_dir
11/01/2012 07:26:47|  main|srvname|I|using "/bin/mail" for mailer
11/01/2012 07:26:47|  main|srvname|I|using "/usr/bin/xterm" for xterm
11/01/2012 07:26:47|  main|srvname|I|using "none" for load_sensor
11/01/2012 07:26:47|  main|srvname|I|using "none" for prolog
11/01/2012 07:26:47|  main|srvname|I|using "none" for epilog
11/01/2012 07:26:47|  main|srvname|I|using "unix_behavior" for
shell_start_mode
11/01/2012 07:26:47|  main|srvname|I|using "sh,bash,ksh,csh,tcsh" for
login_shells
11/01/2012 07:26:47|  main|srvname|I|using "0" for min_uid
11/01/2012 07:26:47|  main|srvname|I|using "0" for min_gid
11/01/2012 07:26:47|  main|srvname|I|using "20000-20100" for gid_range
11/01/2012 07:26:47|  main|srvname|I|using "00:00:40" for
load_report_time
11/01/2012 07:26:47|  main|srvname|I|using "false" for enforce_project
11/01/2012 07:26:47|  main|srvname|I|using "auto" for enforce_user
11/01/2012 07:26:47|  main|srvname|I|using "00:05:00" for max_unheard
11/01/2012 07:26:47|  main|srvname|I|using "log_info" for loglevel
11/01/2012 07:26:47|  main|srvname|I|using "none" for administrator_mail
11/01/2012 07:26:47|  main|srvname|I|using "none" for set_token_cmd
11/01/2012 07:26:47|  main|srvname|I|using "none" for pag_cmd
11/01/2012 07:26:47|  main|srvname|I|using "none" for token_extend_time
11/01/2012 07:26:47|  main|srvname|I|using "none" for shepherd_cmd
11/01/2012 07:26:47|  main|srvname|I|using "none" for qmaster_params
11/01/2012 07:26:47|  main|srvname|I|using "ENABLE_BINDING=true
H_MEMORYLOCKED=1099509530624 S_MEMORYLOCKED=1099509530624" for
execd_params
11/01/2012 07:26:47|  main|srvname|I|using "accounting=true
reporting=true flush_time=00:00:15 joblog=false sharelog=00:00:00" for
reporting_params
11/01/2012 07:26:47|  main|srvname|I|using "100" for finished_jobs
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for qlogin_daemon
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for qlogin_command
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for rsh_daemon
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for rsh_command
11/01/2012 07:26:47|  main|srvname|I|using "none" for jsv_url
11/01/2012 07:26:47|  main|srvname|I|using "ac,h,i,e,o,j,M,N,p,w" for
jsv_allowed_mod
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for rlogin_daemon
11/01/2012 07:26:47|  main|srvname|I|using "builtin" for rlogin_command
11/01/2012 07:26:47|  main|srvname|I|using "00:00:00" for
reschedule_unknown
11/01/2012 07:26:47|  main|srvname|I|using "2000" for max_aj_instances
11/01/2012 07:26:47|  main|srvname|I|using "75000" for max_aj_tasks
11/01/2012 07:26:47|  main|srvname|I|using "0" for max_u_jobs
11/01/2012 07:26:47|  main|srvname|I|using "0" for max_jobs
11/01/2012 07:26:47|  main|srvname|I|using "0" for
max_advance_reservations
11/01/2012 07:26:47|  main|srvname|I|using "false" for reprioritize
11/01/2012 07:26:47|  main|srvname|I|using "0" for auto_user_oticket
11/01/2012 07:26:47|  main|srvname|I|using "0" for auto_user_fshare
11/01/2012 07:26:47|  main|srvname|I|using "none" for
auto_user_default_project
11/01/2012 07:26:47|  main|srvname|I|using "86400" for
auto_user_delete_time
11/01/2012 07:26:47|  main|srvname|I|using "false" for
delegated_file_staging
11/01/2012 07:26:47|  main|srvname|I|using "" for libjvm_path
11/01/2012 07:26:47|  main|srvname|I|using "" for additional_jvm_args
11/01/2012 07:26:47|  main|srvname|I|read job database with 0 entries in
0 seconds
11/01/2012 07:26:47|  main|srvname|W|nr of dynamic event clients exceeds
max file descriptor limit, setting MAX_DYN_EC=979
11/01/2012 07:26:47|  main|srvname|I|max dynamic event clients is set to
979
11/01/2012 07:26:47|  main|srvname|I|qmaster hard descriptor limit is
set to 1024
11/01/2012 07:26:47|  main|srvname|I|qmaster soft descriptor limit is
set to 1024
11/01/2012 07:26:47|  main|srvname|I|qmaster will use max. 1004 file
descriptors for communication
11/01/2012 07:26:47|  main|srvname|I|qmaster will accept max. 979
dynamic event clients
11/01/2012 07:26:47|  main|srvname|I|starting up SGE 8.1.2 (lx-amd64)
11/01/2012 07:26:48|  main|srvname|I|2 worker threads are enabled
11/01/2012 07:26:48|  main|srvname|I|2 listener threads are enabled
11/01/2012 07:26:48|schedu|srvname|I|"scheduler" registers as event
client with id 1 event delivery interval 10
11/01/2012 07:26:48|schedu|srvname|I|sge_clab2dev at srvname added
"scheduler" to event client list
11/01/2012 07:26:48|schedu|srvname|I|using "default" as algorithm
11/01/2012 07:26:48|schedu|srvname|I|using "0:0:30" for
schedule_interval
11/01/2012 07:26:48|schedu|srvname|I|using "0:0:0" for
load_adjustment_decay_time
11/01/2012 07:26:48|schedu|srvname|I|using "mem_total" for load_formula
11/01/2012 07:26:48|schedu|srvname|I|using "true" for schedd_job_info
11/01/2012 07:26:48|schedu|srvname|I|using param: "none"
11/01/2012 07:26:48|schedu|srvname|I|using "0:0:0" for
reprioritize_interval
11/01/2012 07:26:48|schedu|srvname|I|using "cpu=0.75,mem=0.25,io=0" for
usage_weight_list
11/01/2012 07:26:48|schedu|srvname|I|using "none" for
halflife_decay_list
11/01/2012 07:26:48|schedu|srvname|I|using "OFS" for policy_hierarchy
11/01/2012 07:26:48|schedu|srvname|I|using "NONE" for
job_load_adjustments
11/01/2012 07:26:48|schedu|srvname|I|using 0 for maxujobs
11/01/2012 07:26:48|schedu|srvname|I|using 0 for queue_sort_method
11/01/2012 07:26:48|schedu|srvname|I|using 1 for flush_submit_sec
11/01/2012 07:26:48|schedu|srvname|I|using 1 for flush_finish_sec
11/01/2012 07:26:48|schedu|srvname|I|using 144 for halftime
11/01/2012 07:26:48|schedu|srvname|I|using 5 for compensation_factor
11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_user
11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_project
11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_department
11/01/2012 07:26:48|schedu|srvname|I|using 0.25 for weight_job
11/01/2012 07:26:48|schedu|srvname|I|using 10000 for
weight_tickets_functional
11/01/2012 07:26:48|schedu|srvname|I|using 100000 for
weight_tickets_share
11/01/2012 07:26:48|schedu|srvname|I|using 1 for share_override_tickets
11/01/2012 07:26:48|schedu|srvname|I|using 1 for share_functional_shares
11/01/2012 07:26:48|schedu|srvname|I|using 200 for
max_functional_jobs_to_schedule
11/01/2012 07:26:48|schedu|srvname|I|using 1 for report_pjob_tickets
11/01/2012 07:26:48|schedu|srvname|I|using 50 for
max_pending_tasks_per_job
11/01/2012 07:26:48|schedu|srvname|I|using 0.5 for weight_ticket
11/01/2012 07:26:48|schedu|srvname|I|using 0.075 for weight_waiting_time
11/01/2012 07:26:48|schedu|srvname|I|using 3.6e+06 for weight_deadline
11/01/2012 07:26:48|schedu|srvname|I|using 0.5 for weight_urgency
11/01/2012 07:26:48|schedu|srvname|I|using 1 for weight_priority
11/01/2012 07:26:48|schedu|srvname|I|using 100 for max_reservation
11/01/2012 07:26:48|  main|srvname|I|scheduler has been started
11/01/2012 07:26:48|  main|srvname|I|start of jvm thread is disabled in
bootstrap file
11/01/2012 07:26:48|  main|srvname|I|qmaster startup took 1 seconds
11/01/2012 07:27:12|worker|srvname|I|execd on cla-014.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-005.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-001.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-012.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-009.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-002.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-006.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-007.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-010.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-008.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-011.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-013.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-003.cluster registered
11/01/2012 07:27:12|worker|srvname|I|execd on cla-004.cluster registered

Then ~ one minute of silence, and this appears in the syslog:

Nov  1 07:28:06 srvname kernel: sge_qmaster[26300]: segfault at
0000000000000068 rip 000000000049e49c rsp 0000000046d52f00 error 6

Next, I'll try the GDB approach.

Wbr
Andreas

--------------------------------------------------------------------------
Confidentiality Notice: This message is private and may contain confidential and proprietary information. If you have received this message in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this message is not permitted and may be unlawful.
 




More information about the users mailing list