最近一週都在解決filebeat dns解析失敗的問題。filebeat經過daemonset方式部署在k8s集羣中,從而收集整個主機pods的日誌。在主機os爲centos7.4 的版本集羣中,沒有任何問題。可是os爲centos7.6的集羣中,卻出現瞭解析dns失敗,致使日誌沒法發送到kafka集羣。html
查看filebeat錯誤日誌以下:linux
Failed to connect to broker sg.main2.kafka.metis.service:9092: dial tcp: lookup sg.main2.kafka.metis.service: Try again
因而開啓了debug過程,首先懷疑是coredns出了問題,去exec到pod中進行dig。golang
dig @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service ; <<>> DiG 9.12.4-P2 <<>> @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44350 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;sg.main2.kafka.metis.service. IN A ;; ANSWER SECTION: sg.main2.kafka.metis.service. 30 IN A [10.21.42.97](10.21.42.97) ;; Query time: 1 msec ;; SERVER: [10.247.3.10](10.247.3.10)#53([10.247.3.10](10.247.3.10)) ;; WHEN: Sun Jan 05 14:13:26 UTC 2020 ;; MSG SIZE rcvd: 101
pod中是能夠正常解析的,那麼問題能夠定位到代碼了。centos
這個時候須要strace出馬了。app
發現filebeat 在向127.0.0.1 53 去作dns解析。結果可想而知,解析失敗。dom
須要對應一下golang源碼了。tcp
// Copyright 2009 The Go Authors. All rights reserved. 2// Use of this source code is governed by a BSD-style 3// license that can be found in the LICENSE file. 4 5// +build aix darwin dragonfly freebsd linux netbsd openbsd solaris 6 7// Read system DNS config from /etc/resolv.conf 8 9package net 10 11import ( 12 "internal/bytealg" 13 "os" 14 "sync/atomic" 15 "time" 16) 17 18var ( 19 defaultNS = []string{"127.0.0.1:53", "[::1]:53"} 20 getHostname = os.Hostname // variable for testing 21) 22 23type dnsConfig struct { 24 servers []string // server addresses (in host:port form) to use 25 search []string // rooted suffixes to append to local name 26 ndots int // number of dots in name to trigger absolute lookup 27 timeout time.Duration // wait before giving up on a query, including retries 28 attempts int // lost packets before giving up on server 29 rotate bool // round robin among servers 30 unknownOpt bool // anything unknown was encountered 31 lookup []string // OpenBSD top-level database "lookup" order 32 err error // any error that occurs during open of resolv.conf 33 mtime time.Time // time of resolv.conf modification 34 soffset uint32 // used by serverOffset 35 singleRequest bool // use sequential A and AAAA queries instead of parallel queries 36 useTCP bool // force usage of TCP for DNS resolutions 37} 38 39// See resolv.conf(5) on a Linux machine. 40func dnsReadConfig(filename string) *dnsConfig { 41 conf := &dnsConfig{ 42 ndots: 1, 43 timeout: 5 * time.Second, 44 attempts: 2, 45 } 46 file, err := open(filename) 47 if err != nil { 48 conf.servers = defaultNS 49 conf.search = dnsDefaultSearch() 50 conf.err = err 51 return conf 52 } 53 defer file.close() 54 if fi, err := file.file.Stat(); err == nil { 55 conf.mtime = fi.ModTime() 56 } else { 57 conf.servers = defaultNS 58 conf.search = dnsDefaultSearch() 59 conf.err = err 60 return conf 61 } 62 for line, ok := file.readLine(); ok; line, ok = file.readLine() { 63 if len(line) > 0 && (line[0] == ';' || line[0] == '#') { 64 // comment. 65 continue 66 } 67 f := getFields(line) 68 if len(f) < 1 { 69 continue 70 } 71 switch f[0] { 72 case "nameserver": // add one name server 73 if len(f) > 1 && len(conf.servers) < 3 { // small, but the standard limit 74 // One more check: make sure server name is 75 // just an IP address. Otherwise we need DNS 76 // to look it up. 77 if parseIPv4(f[1]) != nil { 78 conf.servers = append(conf.servers, JoinHostPort(f[1], "53")) 79 } else if ip, _ := parseIPv6Zone(f[1]); ip != nil { 80 conf.servers = append(conf.servers, JoinHostPort(f[1], "53")) 81 } 82 } 83 84 case "domain": // set search path to just this domain 85 if len(f) > 1 { 86 conf.search = []string{ensureRooted(f[1])} 87 } 88 89 case "search": // set search path to given servers 90 conf.search = make([]string, len(f)-1) 91 for i := 0; i < len(conf.search); i++ { 92 conf.search[i] = ensureRooted(f[i+1]) 93 } 94 95 case "options": // magic options 96 for _, s := range f[1:] { 97 switch { 98 case hasPrefix(s, "ndots:"): 99 n, _, _ := dtoi(s[6:]) 100 if n < 0 { 101 n = 0 102 } else if n > 15 { 103 n = 15 104 } 105 conf.ndots = n 106 case hasPrefix(s, "timeout:"): 107 n, _, _ := dtoi(s[8:]) 108 if n < 1 { 109 n = 1 110 } 111 conf.timeout = time.Duration(n) * time.Second 112 case hasPrefix(s, "attempts:"): 113 n, _, _ := dtoi(s[9:]) 114 if n < 1 { 115 n = 1 116 } 117 conf.attempts = n 118 case s == "rotate": 119 conf.rotate = true 120 case s == "single-request" || s == "single-request-reopen": 121 // Linux option: 122 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html 123 // "By default, glibc performs IPv4 and IPv6 lookups in parallel [...] 124 // This option disables the behavior and makes glibc 125 // perform the IPv6 and IPv4 requests sequentially." 126 conf.singleRequest = true 127 case s == "use-vc" || s == "usevc" || s == "tcp": 128 // Linux (use-vc), FreeBSD (usevc) and OpenBSD (tcp) option: 129 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html 130 // "Sets RES_USEVC in _res.options. 131 // This option forces the use of TCP for DNS resolutions." 132 // https://www.freebsd.org/cgi/man.cgi?query=resolv.conf&sektion=5&manpath=freebsd-release-ports 133 // https://man.openbsd.org/resolv.conf.5 134 conf.useTCP = true 135 default: 136 conf.unknownOpt = true 137 } 138 } 139 140 case "lookup": 141 // OpenBSD option: 142 // https://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/resolv.conf.5 143 // "the legal space-separated values are: bind, file, yp" 144 conf.lookup = f[1:] 145 146 default: 147 conf.unknownOpt = true 148 } 149 } 150 if len(conf.servers) == 0 { 151 conf.servers = defaultNS 152 } 153 if len(conf.search) == 0 { 154 conf.search = dnsDefaultSearch() 155 } 156 return conf 157} 158 159// serverOffset returns an offset that can be used to determine 160// indices of servers in c.servers when making queries. 161// When the rotate option is enabled, this offset increases. 162// Otherwise it is always 0. 163func (c *dnsConfig) serverOffset() uint32 { 164 if c.rotate { 165 return atomic.AddUint32(&c.soffset, 1) - 1 // return 0 to start 166 } 167 return 0 168} 169 170func dnsDefaultSearch() []string { 171 hn, err := getHostname() 172 if err != nil { 173 // best effort 174 return nil 175 } 176 if i := bytealg.IndexByteString(hn, '.'); i >= 0 && i < len(hn)-1 { 177 return []string{ensureRooted(hn[i+1:])} 178 } 179 return nil 180} 181 182func hasPrefix(s, prefix string) bool { 183 return len(s) >= len(prefix) && s[:len(prefix)] == prefix 184} 185 186func ensureRooted(s string) string { 187 if len(s) > 0 && s[len(s)-1] == '.' { 188 return s 189 } 190 return s + "." 191}
因爲咱們一樣的代碼在centos7.4版本的集羣中,運行沒有問題,因此懷疑是基礎鏡像alpine3.8和centos 7.6存在某些兼容性的問題。函數
咱們知道golang dns解析支持cgo和purego兩種模式。那多是某些設置致使golang 經過cgo去解析,而後alpine 使用的是比較特殊的musl庫。可能這個庫和centos7.6 不兼容。ui
var lookupOrderName = map[hostLookupOrder]string{ hostLookupCgo: "cgo", hostLookupFilesDNS: "files,dns", hostLookupDNSFiles: "dns,files", hostLookupFiles: "files", hostLookupDNS: "dns", }
其中hostLookupCgo
是一類,表示直接調用libc的getaddrinfo方法去解析。this
域名解析函數,Dial函數會間接調用到,而LokupHost和LookupAddr則會直接調用域名解析函數,不一樣的操做系統實現不一樣, 在Unix系統中有兩種方法進行域名解析:
- 純GO語言實現的域名解析,從/etc/resolv.conf中取出本地dns server地址列表, 發送DNS請求(UDP報文)並得到結果
- 使用cgo方式, 最終會調用到c標準庫的getaddrinfo或getnameinfo函數(不建議使用對GO協程不友好)
能夠經過GODEBUG環境變量來設置go語言的默認DNS解析方式 純go或cgo,
export GODEBUG=netdns=go # force pure Go resolver 純go 方式
export GODEBUG=netdns=cgo # force cgo resolver cgo 方式
爲了印證猜測,分析GO語言的域名解析流程,強制export GODEBUG=netdns=go+9,問題不出現,設置爲export GODEBUG=netdns=cgo+9,問題出現,在go1.11的版本中會走到cgo流程.
而後在編譯filebeat的時候禁用cgo,以下:
CGO_ENABLED=0 go build --ldflags -w -o filebeat
一勞永逸解決。
在go調用C函數入口(getaddrinfo)增長了打印,發現正常和異常的場景下,入參是一致的,可是到lib庫中的行爲與低版本操做系統存在差別,存在lib庫兼容性問題。