Golang - 調度剖析【第三部分】

時間 2019-11-07

原文原文鏈接

本篇是調度剖析的第三部分，將重點關注併發特性。
回顧：
第一部分
 第二部分

簡介

首先，在我平時遇到問題的時候，特別是若是它是一個新問題，我一開始並不會考慮使用併發的設計去解決它。我會先實現順序執行的邏輯，並確保它能正常工做。而後在可讀性和技術關鍵點都 Review 以後，我纔會開始思考併發執行的實用性和可行性。有的時候，併發執行是一個很好的選擇，有時則不必定。git

在本系列的第一部分中，我解釋了系統調度的機制和語義，若是你打算編寫多線程代碼，我認爲這些機制和語義對於實現正確的邏輯是很重要的。在第二部分中，我解釋了Go 調度的語義，我認爲它能幫助你理解如何在 Go 中編寫高質量的併發程序。在這篇文章中，我會把系統調度和Go 調度的機制和語義結合在一塊兒，以便更深刻地理解什麼纔是併發以及它的本質。github

什麼是併發

併發意味着亂序執行。拿一組原來是順序執行的指令，然後找到一種方法，使這些指令亂序執行，但仍然產生相同的結果。那麼，順序執行仍是亂序執行？根本在於，針對咱們目前考慮的問題，使用併發必須是有收益的！確切來講，是併發帶來的性能提高要大於它帶來的複雜性成本。固然有些場景，代碼邏輯就已經約束了咱們不能執行亂序，這樣使用併發也就沒有了意義。算法

併發與並行

理解併發與並行的不一樣也很是重要。並行意味着同時執行兩個或更多指令，簡單來講，只有多個CPU核心之間才叫並行。在 Go 中，至少要有兩個操做系統硬件線程並至少有兩個 Goroutine 時才能實現並行，每一個 Goroutine 在一個單獨的系統線程上執行指令。segmentfault

如圖：

咱們看到有兩個邏輯處理器P，每一個邏輯處理器都掛載在一個系統線程M上，而每一個M適配到計算機上的一個CPU處理器Core。
其中，有兩個 Goroutine G1 和 G2 在並行執行，由於它們同時在各自的系統硬件線程上執行指令。
再看，在每個邏輯處理器中，都有三個 Goroutine G2 G3 G5 或 G1 G4 G6 輪流共享各自的系統線程。看起來就像這三個 Goroutine 在同時運行着，沒有特定順序地執行它們的指令，並在系統線程上共享時間。
那麼這就會發生競爭，有時候若是隻在一個物理核心上實現併發則實際上會下降吞吐量。還有有意思的是，有時候即使利用上了並行的併發，也不會給你帶來想象中更大的性能提高。網絡

工做負載

咱們怎麼判斷在何時併發會更有意義呢？咱們就從瞭解當前執行邏輯的工做負載類型開始。在考慮併發時，有兩種類型的工做負載是很重要的。多線程

兩種類型

CPU-Bound：這是一種不會致使 Goroutine 主動切換上下文到等待狀態的類型。它會一直不停地進行計算。好比說，計算 π 到第 N 位的 Goroutine 就是 CPU-Bound 的。併發

IO-Bound：與上面相反，這種類型會致使 Goroutine 天然地進入到等待狀態。它包括請求經過網絡訪問資源，或使用系統調用進入操做系統，或等待事件的發生。好比說，須要讀取文件的 Goroutine 就是 IO-Bound。我把同步事件（互斥，原子），會致使 Goroutine 等待的狀況也包含在此類。ide

在 CPU-Bound 中，咱們須要利用並行。由於單個系統線程處理多個 Goroutine 的效率不高。而使用比系統線程更多的 Goroutine 也會拖慢執行速度，由於在系統線程上切換 Goroutine 是有時間成本的。上下文切換會致使發生STW(Stop The World)，意思是在切換期間當前工做指令都不會被執行。函數

在 IO-Bound 中，並行則不是必須的了。單個系統線程能夠高效地處理多個 Goroutine，是由於Goroutine 在執行這類指令時會天然地進入和退出等待狀態。使用比系統線程更多的 Goroutine 能夠加快執行速度，由於此時在系統線程上切換 Goroutine 的延遲成本並不會產生STW事件。進入到IO阻塞時，CPU就閒下來了，那麼咱們可使不一樣的 Goroutine 有效地複用相同的線程，不讓系統線程閒置。性能

咱們如何評估一個系統線程匹配多少 Gorountine 是最合適的呢？若是 Goroutine 少了，則會沒法充分利用硬件；若是 Goroutine 多了，則會致使上下文切換延遲。這是一個值得考慮的問題，但此時暫不深究。

如今，更重要的是要經過仔細推敲代碼來幫助咱們準確識別什麼狀況須要併發，什麼狀況不能用併發，以及是否須要並行。

加法

咱們不須要複雜的代碼來展現和理解這些語義。先來看看下面這個名爲add的函數：

1 func add(numbers []int) int {
2      var v int
3     for _, n := range numbers {
4         v += n
5     }
6     return v
7 }

在第 1 行，聲明瞭一個名爲add的函數，它接收一個整型切片並返回切片中全部元素的和。它從第 2 行開始，聲明瞭一個v變量來保存總和。而後第 3 行，線性地遍歷切片，而且每一個數字被加到v中。最後在第 6 行，函數將最終的總和返回給調用者。

問題：add函數是否適合併發執行？從大致上來講答案是適合的。能夠將輸入切片分解，而後同時處理它們。最後將每一個小切片的執行結果相加，就能夠獲得和順序執行相同的最終結果。

與此同時，引伸出另一個問題：應該分紅多少個小切片來處理是性能最佳的呢？要回答此問題，咱們必須知道它的工做負載類型。
add函數正在執行 CPU-Bound 工做負載，由於實現算法正在執行純數學運算，而且它不會致使 Goroutine 進入等待狀態。這意味着每一個系統線程使用一個 Goroutine 就能夠得到不錯的吞吐量。

併發版本

下面來看一下併發版本如何實現，聲明一個 addConcurrent 函數。代碼量相比順序版本增長了不少。

1 func addConcurrent(goroutines int, numbers []int) int {
2     var v int64
3     totalNumbers := len(numbers)
4     lastGoroutine := goroutines - 1
5     stride := totalNumbers / goroutines
6
7     var wg sync.WaitGroup
8     wg.Add(goroutines)
9
10     for g := 0; g < goroutines; g++ {
11         go func(g int) {
12             start := g * stride
13             end := start + stride
14             if g == lastGoroutine {
15                 end = totalNumbers
16             }
17
18             var lv int
19             for _, n := range numbers[start:end] {
20                 lv += n
21             }
22
23             atomic.AddInt64(&v, int64(lv))
24             wg.Done()
25         }(g)
26     }
27
28     wg.Wait()
29
30     return int(v)
31 }

第 5 行：計算每一個 Goroutine 的子切片大小。使用輸入切片總數除以 Goroutine 的數量獲得。
第 10 行：建立必定數量的 Goroutine 執行子任務
第 14-16 行：子切片剩下的全部元素都放到最後一個 Goroutine 執行，可能比前幾個 Goroutine 處理的數據要多。
第 23 行：將子結果追加到最終結果中。

然而，併發版本確定比順序版本更復雜，但和增長的複雜性相比，性能有提高嗎？值得這麼作嗎？讓咱們用事實來講話，下面運行基準測試。

基準測試

下面的基準測試，我使用了1000萬個數字的切片，並關閉了GC。分別有順序版本add函數和併發版本addConcurrent函數。

func BenchmarkSequential(b *testing.B) {
    for i := 0; i < b.N; i++ {
        add(numbers)
    }
}

func BenchmarkConcurrent(b *testing.B) {
    for i := 0; i < b.N; i++ {
        addConcurrent(runtime.NumCPU(), numbers)
    }
}

無並行

如下是全部 Goroutine 只有一個硬件線程可用的結果。順序版本使用 1 Goroutine，併發版本在個人機器上使用runtime.NumCPU或 8 Goroutines。在這種狀況下，併發版本實際正跑在沒有並行的機制上。

10 Million Numbers using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential              1000       5720764 ns/op : ~10% Faster
BenchmarkConcurrent              1000       6387344 ns/op
BenchmarkSequentialAgain         1000       5614666 ns/op : ~13% Faster
BenchmarkConcurrentAgain         1000       6482612 ns/op

結果代表：當只有一個系統線程可用於全部 Goroutine 時，順序版本比並發快約10％到13％。這和咱們以前的理論預期相符，主要就是由於併發版本在單核上的上下文切換和 Goroutine 管理調度的開銷。

有並行

如下是每一個 Goroutine 都有單獨可用的系統線程的結果。順序版本使用 1 Goroutine，併發版本在個人機器上使用runtime.NumCPU或 8 Goroutines。在這種狀況下，併發版本利用上了並行機制。

10 Million Numbers using 8 goroutines with 8 cores
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 8 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/cpu-bound
BenchmarkSequential-8                1000       5910799 ns/op
BenchmarkConcurrent-8                2000       3362643 ns/op : ~43% Faster
BenchmarkSequentialAgain-8           1000       5933444 ns/op
BenchmarkConcurrentAgain-8           2000       3477253 ns/op : ~41% Faster

結果代表：當爲每一個 Goroutine 提供單獨的系統線程時，併發版本比順序版本快大約41％到43％。這才也和預期一致，全部 Goroutine 現都在並行運行着，意味着他們真的在同時執行。

排序

另外，咱們也要知道並不是全部的 CPU-Bound 都適合併發。當切分輸入或合併結果的代價很是高時，就不太合適。下面展現一個冒泡排序算法來講明此場景。

順序版本

01 package main
02
03 import "fmt"
04
05 func bubbleSort(numbers []int) {
06     n := len(numbers)
07     for i := 0; i < n; i++ {
08         if !sweep(numbers, i) {
09             return
10         }
11     }
12 }
13
14 func sweep(numbers []int, currentPass int) bool {
15     var idx int
16     idxNext := idx + 1
17     n := len(numbers)
18     var swap bool
19
20     for idxNext < (n - currentPass) {
21         a := numbers[idx]
22         b := numbers[idxNext]
23         if a > b {
24             numbers[idx] = b
25             numbers[idxNext] = a
26             swap = true
27         }
28         idx++
29         idxNext = idx + 1
30     }
31     return swap
32 }
33
34 func main() {
35     org := []int{1, 3, 2, 4, 8, 6, 7, 2, 3, 0}
36     fmt.Println(org)
37
38     bubbleSort(org)
39     fmt.Println(org)
40 }

這種排序算法會掃描每次在交換值時傳遞的切片。在對全部內容進行排序以前，可能須要屢次遍歷切片。

那麼問題：bubbleSort函數是否適用併發？我相信答案是否認的。原始切片能夠分解爲較小的，而且能夠同時對它們排序。可是！在併發執行完以後，沒有一個有效的手段將子結果的切片排序合併。下面咱們來看併發版本是如何實現的。

併發版本

01 func bubbleSortConcurrent(goroutines int, numbers []int) {
02     totalNumbers := len(numbers)
03     lastGoroutine := goroutines - 1
04     stride := totalNumbers / goroutines
05
06     var wg sync.WaitGroup
07     wg.Add(goroutines)
08
09     for g := 0; g < goroutines; g++ {
10         go func(g int) {
11             start := g * stride
12             end := start + stride
13             if g == lastGoroutine {
14                 end = totalNumbers
15             }
16
17             bubbleSort(numbers[start:end])
18             wg.Done()
19         }(g)
20     }
21
22     wg.Wait()
23
24     // Ugh, we have to sort the entire list again.
25     bubbleSort(numbers)
26 }

bubbleSortConcurrent它使用多個 Goroutine 同時對輸入的一部分進行排序。咱們直接來看結果：

Before:
  25 51 15 57 87 10 10 85 90 32 98 53
  91 82 84 97 67 37 71 94 26  2 81 79
  66 70 93 86 19 81 52 75 85 10 87 49

After:
  10 10 15 25 32 51 53 57 85 87 90 98
   2 26 37 67 71 79 81 82 84 91 94 97
  10 19 49 52 66 70 75 81 85 86 87 93

因爲冒泡排序的本質是依次掃描，第 25 行對 bubbleSort 的調用將掩蓋使用併發解決問題帶來的潛在收益。結論是：在冒泡排序中，使用併發不會帶來性能提高。

讀取文件

前面已經舉了兩個 CPU-Bound 的例子，下面咱們來看 IO-Bound。

順序版本

01 func find(topic string, docs []string) int {
02     var found int
03     for _, doc := range docs {
04         items, err := read(doc)
05         if err != nil {
06             continue
07         }
08         for _, item := range items {
09             if strings.Contains(item.Description, topic) {
10                 found++
11             }
12         }
13     }
14     return found
15 }

第 2 行：聲明瞭一個名爲 found 的變量，用於保存在給定文檔中找到指定主題的次數。
第 3-4 行：迭代文檔，並使用read函數讀取每一個文檔。
第 8-11 行：使用 strings.Contains 函數檢查文檔中是否包含指定主題。若是包含，則found加1。

而後來看一下read是如何實現的。

01 func read(doc string) ([]item, error) {
02     time.Sleep(time.Millisecond) // 模擬阻塞的讀
03     var d document
04     if err := xml.Unmarshal([]byte(file), &d); err != nil {
05         return nil, err
06     }
07     return d.Channel.Items, nil
08 }

此功能以 time.Sleep 開始，持續1毫秒。此調用用於模擬在咱們執行實際系統調用以從磁盤讀取文檔時可能產生的延遲。這種延遲的一致性對於準確測量find順序版本和併發版本的性能差距很是重要。
而後在第 03-07 行，將存儲在全局變量文件中的模擬 xml 文檔反序列化爲struct值。最後，將Items返回給調用者。

併發版本

01 func findConcurrent(goroutines int, topic string, docs []string) int {
02     var found int64
03
04     ch := make(chan string, len(docs))
05     for _, doc := range docs {
06         ch <- doc
07     }
08     close(ch)
09
10     var wg sync.WaitGroup
11     wg.Add(goroutines)
12
13     for g := 0; g < goroutines; g++ {
14         go func() {
15             var lFound int64
16             for doc := range ch {
17                 items, err := read(doc)
18                 if err != nil {
19                     continue
20                 }
21                 for _, item := range items {
22                     if strings.Contains(item.Description, topic) {
23                         lFound++
24                     }
25                 }
26             }
27             atomic.AddInt64(&found, lFound)
28             wg.Done()
29         }()
30     }
31
32     wg.Wait()
33
34     return int(found)
35 }

第 4-7 行：建立一個channel並寫入全部要處理的文檔。
第 8 行：關閉這個channel，這樣當讀取完全部文檔後就會直接退出循環。
第 16-26 行：每一個 Goroutine 都從同一個channel接收文檔，read 並 strings.Contains 邏輯和順序的版本一致。
第 27 行：將各個 Goroutine 計數加在一塊兒做爲最終計數。

基準測試

一樣的，咱們再次運行基準測試來驗證咱們的結論。

func BenchmarkSequential(b *testing.B) {
    for i := 0; i < b.N; i++ {
        find("test", docs)
    }
}

func BenchmarkConcurrent(b *testing.B) {
    for i := 0; i < b.N; i++ {
        findConcurrent(runtime.NumCPU(), "test", docs)
    }
}

無並行

10 Thousand Documents using 8 goroutines with 1 core
2.9 GHz Intel 4 Core i7
Concurrency WITHOUT Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -cpu 1 -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential                 3    1483458120 ns/op
BenchmarkConcurrent                20     188941855 ns/op : ~87% Faster
BenchmarkSequentialAgain            2    1502682536 ns/op
BenchmarkConcurrentAgain           20     184037843 ns/op : ~88% Faster

當只有一個系統線程時，併發版本比順序版本快大約87％到88％。與預期一致，由於全部 Goroutine 都有效地共享單個系統線程。

有並行

10 Thousand Documents using 8 goroutines with 8 core
2.9 GHz Intel 4 Core i7
Concurrency WITH Parallelism
-----------------------------------------------------------------------------
$ GOGC=off go test -run none -bench . -benchtime 3s
goos: darwin
goarch: amd64
pkg: github.com/ardanlabs/gotraining/topics/go/testing/benchmarks/io-bound
BenchmarkSequential-8                   3    1490947198 ns/op
BenchmarkConcurrent-8                  20     187382200 ns/op : ~88% Faster
BenchmarkSequentialAgain-8              3    1416126029 ns/op
BenchmarkConcurrentAgain-8             20     185965460 ns/op : ~87% Faster

有意思的來了，使用額外的系統線程提供並行能力，實際代碼性能卻沒有提高。也印證了開頭的說法。

結語

咱們能夠清楚地看到，使用 IO-Bound 並不須要並行來得到性能上的巨大提高。這與咱們在 CPU-Bound 中看到的結果相反。當涉及像冒泡排序這樣的算法時，併發的使用會增長複雜性而沒有任何實際的性能優點。因此，咱們在考慮解決方案時，首先要肯定它是否適合併發，而不是盲目認爲使用更多的 Goroutine 就必定會提高性能。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。