阿里雲環境部署Hyperledger Fabric之SIGSEGV問題分析和解決經驗分享【轉】

時間 2019-12-07

標籤阿里環境部署 hyperledger fabric sigsegv 問題分析解決經驗分享欄目阿里巴巴简体版

原文原文鏈接

摘要：引言最近收到Hyperledger社區的一些朋友反饋在阿里雲環境上部署開源區塊鏈項目Hyperledger Fabric的過程當中遇到了和SIGSEV相關的fatal error，正好我此前也遇到並解決過相似的問題，所以這裏分享一下當時問題的分析過程和解決的經驗，但願能帶來一點啓發和幫助。git

最近收到Hyperledger社區的一些朋友反饋在阿里雲環境上部署開源區塊鏈項目Hyperledger Fabric的過程當中遇到了和SIGSEV相關的fatal error，正好筆者此前也遇到並解決過相似的問題，所以這裏分享一下當時問題的分析過程和解決的經驗，但願能爲你們帶來一點啓發和幫助。github

問題描述
在部署Hyperledger Fabric過程當中，peer、orderer服務啓動失敗，同時cli容器上執行cli-test.sh測試時也報錯。錯誤類型均是signal SIGSEGV: segmentation violation。錯誤日誌示例以下：golang

2017-11-01 02:44:04.247 UTC [peer] updateTrustedRoots -> DEBU 2a0 Updating trusted root authorities for channel mychannel
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f9d15ded259]
runtime stack:
runtime.throw(0xdc37a7, 0x2a)
/opt/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
/opt/go/src/runtime/sigpanic_unix.go:12 +0x2cc
goroutine 64 [syscall, locked to thread]:
runtime.cgocall(0xb08d50, 0xc4203bcdf8, 0xc400000000)
/opt/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4203bcdb0 sp=0xc4203bcd70
net._C2func_getaddrinfo(0x7f9d000008c0, 0x0, 0xc420323110, 0xc4201a01e8, 0x0, 0x0, 0x0)
分析過程
咱們進行了深刻分析和試驗，在Hyperledger Fabric這個bug https://jira.hyperledger.org/browse/FAB-5822的啓發下，採用了以下workaround能夠解決這個問題：docker

在docker compose yaml裏對peer、orderer、cli的環境變量加入GODEBUG=netdns=go
這個設置的做用是不採用cgo resolver （從錯誤日誌裏可看到是cgo resolver拋出的錯誤）而採用pure go resolver。dom

進一步分析golang在什麼狀況下會在cgo resolver和pure go resolver之間切換：區塊鏈

golang的官方文檔說明：https://golang.org/pkg/net/測試

Name Resolution
The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:
export GODEBUG=netdns=go # force pure Go resolver
export GODEBUG=netdns=cgo # force cgo resolver*ui

根據這一線索，咱們對比了此前部署成功環境和最近部署失敗環境各自的底層配置文件，最終找到了不一樣之處：this

在老環境（區塊鏈部署成功)上的容器裏，查看阿里雲

cat /etc/resolv.conf

nameserver 127.0.0.11
options ndots:0
在新環境（區塊鏈部署失敗）上的容器裏，查看

cat /etc/resolv.conf

nameserver 127.0.0.11
options timeout:2 attempts:3 rotate single-request-reopen ndots:0
這個差別致使了老的成功環境是採用pure Go resolver的，而在新的失敗環境被切換到cgo resolver，這是由於含有pure Go resolver不支持的options single-request-reopen。

注：Pure Go resolver目前僅支持ndots, timeout, attempts, rotate
https://github.com/golang/go/blob/964639cc338db650ccadeafb7424bc8ebb2c0f6c/src/net/dnsconfig_unix.go

case "options": // magic options
        for _, s := range f[1:] {
            switch {
            case hasPrefix(s, "ndots:"):
                n, _, _ := dtoi(s[6:])
                if n < 0 {
                    n = 0
                } else if n > 15 {
                    n = 15
                }
                conf.ndots = n
            case hasPrefix(s, "timeout:"):
                n, _, _ := dtoi(s[8:])
                if n < 1 {
                    n = 1
                }
                conf.timeout = time.Duration(n) * time.Second
            case hasPrefix(s, "attempts:"):
                n, _, _ := dtoi(s[9:])
                if n < 1 {
                    n = 1
                }
                conf.attempts = n
            case s == "rotate":
                conf.rotate = true
            default:
                conf.unknownOpt = true
            }
        }

進一步的，咱們嘗試分析是什麼緣由致使了新老容器內的resolv.conf的內容變化，發現了原來是最近宿主機ECS的配置文件發生了變化：

失敗的環境 - 新建立的ECS:

cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 100.100.2.138
nameserver 100.100.2.136
options timeout:2 attempts:3 rotate single-request-reopen
成功的環境 - 原來的ECS:

cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 100.100.2.136
nameserver 100.100.2.138
另外一方面，咱們也嘗試分析爲何切換到cgo resolver以後會產生SIGSEGV的錯誤，如下這篇文章解釋了static link cgo會致使SIGSEGV的錯誤：
https://tschottdorf.github.io/golang-static-linking-bug

而這個Hyperledger Fabric的bug則指出了Hyperledger Fabric的build（尤爲是和getaddrinfo相關方法）正是static link的：
https://jira.hyperledger.org/browse/FAB-6403

至此，咱們找到了問題的根源和覆盤了整個問題發生的邏輯：

近期新建立的ECS主機中的resolv.conf內容發生了變化 -> 致使Hyperledger Fabric的容器內域名解析從pure Go resolver切換至cgo resolver -> 觸發了一個已知的由靜態連接cgo致使的SIGSEGV錯誤 -> 致使Hyperledger Fabric部署失敗。
解決方法建議
更新Hyperledger Fabric的docker compose yaml模板，爲全部Hyperledger Fabric的節點（如orderer, peer, ca, cli等）添加環境變量GODEBUG=netdns=go以強制使用pure Go resolver。

阿里雲容器服務區塊鏈解決方案
咱們在阿里雲容器服務上爲開發者提供了Hyperledger Fabric的自動化配置和部署的基礎解決方案，幫助開發者屏蔽底層複雜的操做、更加專一於區塊鏈業務應用的創新，若有興趣進一步瞭解，可參考：

阿里雲容器服務區塊鏈解決方案介紹
阿里雲容器服務區塊鏈解決方案產品文檔

轉自https://yq.aliyun.com/articles/238940