從filebeat dns解析失敗到alpine兼容性和go內部dns實現

最近一週都在解決filebeat dns解析失敗的問題。filebeat經過daemonset方式部署在k8s集羣中,從而收集整個主機pods的日誌。在主機os爲centos7.4 的版本集羣中,沒有任何問題。可是os爲centos7.6的集羣中,卻出現瞭解析dns失敗,致使日誌沒法發送到kafka集羣。html

查看filebeat錯誤日誌以下:linux

Failed to connect to broker sg.main2.kafka.metis.service:9092: dial tcp: lookup sg.main2.kafka.metis.service: Try again

因而開啓了debug過程,首先懷疑是coredns出了問題,去exec到pod中進行dig。golang

dig @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service

  

; <<>> DiG 9.12.4-P2 <<>> @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service

; (1 server found)

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44350

;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

  

;; OPT PSEUDOSECTION:

; EDNS: version: 0, flags:; udp: 4096

;; QUESTION SECTION: 

;sg.main2.kafka.metis.service. IN A

  

;; ANSWER SECTION:

sg.main2.kafka.metis.service. 30 IN A [10.21.42.97](10.21.42.97)

  

;; Query time: 1 msec

;; SERVER: [10.247.3.10](10.247.3.10)#53([10.247.3.10](10.247.3.10))

;; WHEN: Sun Jan 05 14:13:26 UTC 2020

;; MSG SIZE rcvd: 101

pod中是能夠正常解析的,那麼問題能夠定位到代碼了。centos

這個時候須要strace出馬了。app

shareit.png

發現filebeat 在向127.0.0.1 53 去作dns解析。結果可想而知,解析失敗。dom

須要對應一下golang源碼了。tcp

// Copyright 2009 The Go Authors. All rights reserved.
 2// Use of this source code is governed by a BSD-style
 3// license that can be found in the LICENSE file.
 4
 5// +build aix darwin dragonfly freebsd linux netbsd openbsd solaris
 6
 7// Read system DNS config from /etc/resolv.conf
 8
 9package net
 10
 11import (
 12    "internal/bytealg"
 13    "os"
 14    "sync/atomic"
 15    "time"
 16)
 17
 18var (
 19    defaultNS   = []string{"127.0.0.1:53", "[::1]:53"}
 20    getHostname = os.Hostname // variable for testing
 21)
 22
 23type dnsConfig struct {
 24    servers       []string      // server addresses (in host:port form) to use
 25    search       []string      // rooted suffixes to append to local name
 26    ndots         int           // number of dots in name to trigger absolute lookup
 27    timeout       time.Duration // wait before giving up on a query, including retries
 28    attempts      int           // lost packets before giving up on server
 29    rotate        bool          // round robin among servers
 30    unknownOpt    bool          // anything unknown was encountered
 31    lookup        []string      // OpenBSD top-level database "lookup" order
 32    err           error         // any error that occurs during open of resolv.conf
 33    mtime         time.Time     // time of resolv.conf modification
 34    soffset       uint32        // used by serverOffset
 35    singleRequest bool          // use sequential A and AAAA queries instead of parallel queries
 36    useTCP        bool          // force usage of TCP for DNS resolutions
 37}
 38
 39// See resolv.conf(5) on a Linux machine.
 40func dnsReadConfig(filename string) *dnsConfig {
 41    conf := &dnsConfig{
 42        ndots:    1,
 43        timeout:  5 * time.Second,
 44        attempts: 2,
 45    }
 46    file, err := open(filename)
 47    if err != nil {
 48        conf.servers = defaultNS
 49        conf.search = dnsDefaultSearch()
 50        conf.err = err
 51        return conf
 52    }
 53    defer file.close()
 54    if fi, err := file.file.Stat(); err == nil {
 55        conf.mtime = fi.ModTime()
 56    } else {
 57        conf.servers = defaultNS
 58        conf.search = dnsDefaultSearch()
 59        conf.err = err
 60        return conf
 61    }
 62    for line, ok := file.readLine(); ok; line, ok = file.readLine() {
 63        if len(line) > 0 && (line[0] == ';' || line[0] == '#') {
 64            // comment.
 65            continue
 66        }
 67        f := getFields(line)
 68        if len(f) < 1 {
 69            continue
 70        }
 71        switch f[0] {
 72        case "nameserver": // add one name server
 73            if len(f) > 1 && len(conf.servers) < 3 { // small, but the standard limit
 74                // One more check: make sure server name is
 75                // just an IP address. Otherwise we need DNS
 76                // to look it up.
 77                if parseIPv4(f[1]) != nil {
 78                    conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
 79                } else if ip, _ := parseIPv6Zone(f[1]); ip != nil {
 80                    conf.servers = append(conf.servers, JoinHostPort(f[1], "53"))
 81                }
 82            }
 83
 84        case "domain": // set search path to just this domain
 85            if len(f) > 1 {
 86                conf.search = []string{ensureRooted(f[1])}
 87            }
 88
 89        case "search": // set search path to given servers
 90            conf.search = make([]string, len(f)-1)
 91            for i := 0; i < len(conf.search); i++ {
 92                conf.search[i] = ensureRooted(f[i+1])
 93            }
 94
 95        case "options": // magic options
 96            for _, s := range f[1:] {
 97                switch {
 98                case hasPrefix(s, "ndots:"):
 99                    n, _, _ := dtoi(s[6:])
 100                    if n < 0 {
 101                        n = 0
 102                    } else if n > 15 {
 103                        n = 15
 104                    }
 105                    conf.ndots = n
 106                case hasPrefix(s, "timeout:"):
 107                    n, _, _ := dtoi(s[8:])
 108                    if n < 1 {
 109                        n = 1
 110                    }
 111                    conf.timeout = time.Duration(n) * time.Second
 112                case hasPrefix(s, "attempts:"):
 113                    n, _, _ := dtoi(s[9:])
 114                    if n < 1 {
 115                        n = 1
 116                    }
 117                    conf.attempts = n
 118                case s == "rotate":
 119                    conf.rotate = true
 120                case s == "single-request" || s == "single-request-reopen":
 121                    // Linux option:
 122                    // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
 123                    // "By default, glibc performs IPv4 and IPv6 lookups in parallel [...]
 124                    //  This option disables the behavior and makes glibc
 125                    //  perform the IPv6 and IPv4 requests sequentially."
 126                    conf.singleRequest = true
 127                case s == "use-vc" || s == "usevc" || s == "tcp":
 128                    // Linux (use-vc), FreeBSD (usevc) and OpenBSD (tcp) option:
 129                    // http://man7.org/linux/man-pages/man5/resolv.conf.5.html
 130                    // "Sets RES_USEVC in _res.options.
 131                    //  This option forces the use of TCP for DNS resolutions."
 132                    // https://www.freebsd.org/cgi/man.cgi?query=resolv.conf&sektion=5&manpath=freebsd-release-ports
 133                    // https://man.openbsd.org/resolv.conf.5
 134                    conf.useTCP = true
 135                default:
 136                    conf.unknownOpt = true
 137                }
 138            }
 139
 140        case "lookup":
 141            // OpenBSD option:
 142            // https://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/resolv.conf.5
 143            // "the legal space-separated values are: bind, file, yp"
 144            conf.lookup = f[1:]
 145
 146        default:
 147            conf.unknownOpt = true
 148        }
 149    }
 150    if len(conf.servers) == 0 {
 151        conf.servers = defaultNS
 152    }
 153    if len(conf.search) == 0 {
 154        conf.search = dnsDefaultSearch()
 155    }
 156    return conf
 157}
 158
 159// serverOffset returns an offset that can be used to determine
 160// indices of servers in c.servers when making queries.
 161// When the rotate option is enabled, this offset increases.
 162// Otherwise it is always 0.
 163func (c *dnsConfig) serverOffset() uint32 {
 164    if c.rotate {
 165        return atomic.AddUint32(&c.soffset, 1) - 1 // return 0 to start
 166    }
 167    return 0
 168}
 169
 170func dnsDefaultSearch() []string {
 171    hn, err := getHostname()
 172    if err != nil {
 173        // best effort
 174        return nil
 175    }
 176    if i := bytealg.IndexByteString(hn, '.'); i >= 0 && i < len(hn)-1 {
 177        return []string{ensureRooted(hn[i+1:])}
 178    }
 179    return nil
 180}
 181
 182func hasPrefix(s, prefix string) bool {
 183    return len(s) >= len(prefix) && s[:len(prefix)] == prefix
 184}
 185
 186func ensureRooted(s string) string {
 187    if len(s) > 0 && s[len(s)-1] == '.' {
 188        return s
 189    }
 190    return s + "."
 191}

因爲咱們一樣的代碼在centos7.4版本的集羣中,運行沒有問題,因此懷疑是基礎鏡像alpine3.8和centos 7.6存在某些兼容性的問題。函數

咱們知道golang dns解析支持cgo和purego兩種模式。那多是某些設置致使golang 經過cgo去解析,而後alpine 使用的是比較特殊的musl庫。可能這個庫和centos7.6 不兼容。ui

var lookupOrderName = map[hostLookupOrder]string{
    hostLookupCgo:      "cgo",
    hostLookupFilesDNS: "files,dns",
    hostLookupDNSFiles: "dns,files",
    hostLookupFiles:    "files",
    hostLookupDNS:      "dns",
}

其中hostLookupCgo是一類,表示直接調用libc的getaddrinfo方法去解析。this

域名解析函數,Dial函數會間接調用到,而LokupHost和LookupAddr則會直接調用域名解析函數,不一樣的操做系統實現不一樣,  在Unix系統中有兩種方法進行域名解析:

     - 純GO語言實現的域名解析,從/etc/resolv.conf中取出本地dns server地址列表, 發送DNS請求(UDP報文)並得到結果

     - 使用cgo方式, 最終會調用到c標準庫的getaddrinfo或getnameinfo函數(不建議使用對GO協程不友好)

能夠經過GODEBUG環境變量來設置go語言的默認DNS解析方式 純go或cgo,
export GODEBUG=netdns=go    # force pure Go resolver 純go 方式
export GODEBUG=netdns=cgo   # force cgo resolver   cgo 方式

爲了印證猜測,分析GO語言的域名解析流程,強制export GODEBUG=netdns=go+9,問題不出現,設置爲export GODEBUG=netdns=cgo+9,問題出現,在go1.11的版本中會走到cgo流程.

而後在編譯filebeat的時候禁用cgo,以下:

CGO_ENABLED=0 go build --ldflags -w -o filebeat

一勞永逸解決。

在go調用C函數入口(getaddrinfo)增長了打印,發現正常和異常的場景下,入參是一致的,可是到lib庫中的行爲與低版本操做系統存在差別,存在lib庫兼容性問題。

結論

  • 在alpine 環境中,go代碼最好禁用cgo。
  • 在k8s集羣中,選取鏡像最好是和主機os一致的分發版本。
相關文章
相關標籤/搜索