golang 數據競爭檢測器

時間 2019-11-12

標籤 golang 數據競爭檢測器欄目 Go 简体版

原文原文鏈接

更好的閱讀體驗建議點擊下方原文連接。
原文連接：http://maoqide.live/post/golang/golang-data-race-detector/html

[譯] https://golang.google.cn/doc/articles/race_detector.htmllinux

golang 中的幾種 Data Race 場景及 Data Race 檢測工具。
golang

Introduction

數據競爭是併發系統中最多見和最難 debug 的 bug 類型之一，當兩個 goroutine 同時訪問同一個變量而且至少有一個是寫入時，就會發生 data race(數據競爭)。詳細內容能夠閱讀The Go Memory Model。
如下是可能致使崩潰和內存損壞的 data race 示例：shell

func main() {
    c := make(chan bool)
    m := make(map[string]string)
    go func() {
        m["1"] = "a" // First conflicting access.
        c <- true
    }()
    m["2"] = "b" // Second conflicting access.
    <-c
    for k, v := range m {
        fmt.Println(k, v)
    }
}

Usage

爲了幫助診斷此類錯誤，Go 包含一個內置的 data race detector。要使用它，請在go命令中添加-race標誌：windows

$ go test -race mypkg    // to test the package
$ go run -race mysrc.go  // to run the source file
$ go build -race mycmd   // to build the command
$ go install -race mypkg // to install the package

Report Format

當 data race detector 在程序中發現有 data race 時，它會打印一個報告。該報告包含衝突訪問的堆棧跟蹤，以及建立相關 goroutine 的堆棧。如下一個例子：安全

WARNING: DATA RACE
Read by goroutine 185:
  net.(*pollServer).AddFD()
      src/net/fd_unix.go:89 +0x398
  net.(*pollServer).WaitWrite()
      src/net/fd_unix.go:247 +0x45
  net.(*netFD).Write()
      src/net/fd_unix.go:540 +0x4d4
  net.(*conn).Write()
      src/net/net.go:129 +0x101
  net.func·060()
      src/net/timeout_test.go:603 +0xaf

Previous write by goroutine 184:
  net.setWriteDeadline()
      src/net/sockopt_posix.go:135 +0xdf
  net.setDeadline()
      src/net/sockopt_posix.go:144 +0x9c
  net.(*conn).SetDeadline()
      src/net/net.go:161 +0xe3
  net.func·061()
      src/net/timeout_test.go:616 +0x3ed

Goroutine 185 (running) created at:
  net.func·061()
      src/net/timeout_test.go:609 +0x288

Goroutine 184 (running) created at:
  net.TestProlongTimeout()
      src/net/timeout_test.go:618 +0x298
  testing.tRunner()
      src/testing/testing.go:301 +0xe8

Options

環境變量 GORACE 用來設置 data race detector 選項，格式以下：併發

GORACE="option1=val1 option2=val2"

option 有：app

log_path (default stderr): race detector 將其報告寫入名爲log_path.pid的文件。stdout和stderr 分別讓報告寫入標準輸出和標準錯誤。
exitcode (default 66): 檢測到的 race 後使用的退出狀態碼。
strip_path_prefix (default ""): 從全部報告的文件路徑中刪除此前綴，以使報告更簡潔。
history_size (default 1): 每一個 goroutine 內存訪問歷史記錄是32K * 2**history_size elements。加大此值能夠避免「沒法還原堆棧」錯誤報告，但會增長內存使用量。
halt_on_error (default 0): 控制在報告第一次數據競爭後程序是否退出。

示例：ide

$ GORACE="log_path=/tmp/race/report strip_path_prefix=/my/go/sources/" go test -race

Excluding Tests

當使用 -race 標誌構建時，go 命令定義了額外的構建參數race。運行 race detector 時，你可使用此標記排除某些代碼和測試。下面是一些實例：函數

// +build !race

package foo

// The test contains a data race. See issue 123.
func TestFoo(t *testing.T) {
    // ...
}

// The test fails under the race detector due to timeouts.
func TestBar(t *testing.T) {
    // ...
}

// The test takes too long under the race detector.
func TestBaz(t *testing.T) {
    // ...
}

To start, run your tests using the race detector (go test -race). The race detector only finds races that happen at runtime, so it can't find races in code paths that are not executed. If your tests have incomplete coverage, you may find more races by running a binary built with -race under a realistic workload.

How To Use

首先，使用 race detector 運行測試(go test -race)。race detector 僅查找運行時發生的 race，所以沒法在未執行的代碼路徑中找到 race，若是你的測試覆蓋率不徹底，在實際工做負載下運行使用-race構建的二進制文件，你可能會發現更多的 race。

Typical Data Races

如下是一些典型的 data race 場景。全部這些均可以經過 race detector 檢測到：

Race on loop counter(循環計數器競爭)

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func() {
            fmt.Println(i) // Not the 'i' you are looking for.
            wg.Done()
        }()
    }
    wg.Wait()
}

函數傳參中的變量i與 for 循環使用的變量相同，所以 goroutine 中的讀取與循環的自增產生 race（此程序一般會打印出 55555，而不是 01234）。zhegewenti能夠經過對變量i進行復制來修復；

func main() {
    var wg sync.WaitGroup
    wg.Add(5)
    for i := 0; i < 5; i++ {
        go func(j int) {
            fmt.Println(j) // Good. Read local copy of the loop counter.
            wg.Done()
        }(i)
    }
    wg.Wait()
}

Accidentally shared variable(意外的共享變量)

// ParallelWrite writes data to file1 and file2, returns the errors.
func ParallelWrite(data []byte) chan error {
    res := make(chan error, 2)
    f1, err := os.Create("file1")
    if err != nil {
        res <- err
    } else {
        go func() {
            // This err is shared with the main goroutine,
            // so the write races with the write below.
            // 此 err 變量和主 goroutine 共享，因此此寫入和下面的寫入產生 race。
            _, err = f1.Write(data)
            res <- err
            f1.Close()
        }()
    }
    f2, err := os.Create("file2") // The second conflicting write to err.
    if err != nil {
        res <- err
    } else {
        go func() {
            _, err = f2.Write(data)
            res <- err
            f2.Close()
        }()
    }
    return res
}

修復方法是在 goroutines 中引入新變量（注意使用 :=）：

...
            _, err := f1.Write(data)
            ...
            _, err := f2.Write(data)
            ...

Unprotected global variable(無保護的全局變量)

若是有多個 goroutine 調用如下代碼，則會致使 map類型的變量service產生 race。併發讀取和寫入同一個 map 是不安全的：

var service map[string]net.Addr

func RegisterService(name string, addr net.Addr) {
    service[name] = addr
}

func LookupService(name string) net.Addr {
    return service[name]
}

爲了使代碼安全，用互斥鎖mutex來保護訪問權限：

var (
    service   map[string]net.Addr
    serviceMu sync.Mutex
)

func RegisterService(name string, addr net.Addr) {
    serviceMu.Lock()
    defer serviceMu.Unlock()
    service[name] = addr
}

func LookupService(name string) net.Addr {
    serviceMu.Lock()
    defer serviceMu.Unlock()
    return service[name]
}

Primitive unprotected variable(原始無保護變量)

data race 也可能發生在原始類型的變量上（bool，int，int64 等），以下例所示：

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
    w.last = time.Now().UnixNano() // First conflicting access.
}

func (w *Watchdog) Start() {
    go func() {
        for {
            time.Sleep(time.Second)
            // Second conflicting access.
            if w.last < time.Now().Add(-10*time.Second).UnixNano() {
                fmt.Println("No keepalives for 10 seconds. Dying.")
                os.Exit(1)
            }
        }
    }()
}

即便是這種「無辜的」 data race 也會致使難以調試的問題，這些問題是由存儲器訪問的非原子性，編譯器優化的干擾或訪問處理器存儲的從新排序問題引發的。

這種 data race 的典型修復方法是使用 channel 或 mutex。爲了保持無鎖行爲，還可使用sync/atomic包。

type Watchdog struct{ last int64 }

func (w *Watchdog) KeepAlive() {
    atomic.StoreInt64(&w.last, time.Now().UnixNano())
}

func (w *Watchdog) Start() {
    go func() {
        for {
            time.Sleep(time.Second)
            if atomic.LoadInt64(&w.last) < time.Now().Add(-10*time.Second).UnixNano() {
                fmt.Println("No keepalives for 10 seconds. Dying.")
                os.Exit(1)
            }
        }
    }()
}