Go 程序性能分析 101

時間 2020-01-27

原文原文鏈接

過早優化是萬惡之源。html

前言

本文來自 GopherCon 2019 的一個演講，首先從一個計算文本字數的 go 程序與 wc 對比，經過 profile CPU 與內存逐步優化性能，涉及 Go 的併發模型及逃逸分析。而後由一個繪製曼德勃羅分形圖的程序，討論併發程序的執行追蹤（trace）及併發程序能帶來的性能提高的邊界。git

pprof & trace

Go 內置的生態提供了大量的 API 及工具用於診斷程序的邏輯及性能問題。它們大體能夠被分爲如下幾類：github

Profiling：Profiling 工具（如 pprof）用於分析程序中的複雜度與開銷，如內存使用及函數調用頻率等，並經過它們來標識程序中開銷最大的部分。
Tracing：Tracing 用於分析某次調用或用戶請求的整個流程中的延遲。其支持跨進程，並展現整個系統中每一個組件所花費的時間。
Debugging：Debugging 可以檢驗程序狀態及執行流，可以暫停程序並檢驗其執行。
Runtime statistics and events：收集分析運行時統計及事件並提供程序健康情況的高級綜述。尖峯/度的傾角可以幫助咱們肯定吞吐量，利用率和性能的變化。

更多關於 pprof 與 trace 的信息見：golang.org/doc/diagnos…golang

使用 pprof 分析 CPU 或內存使用

以下是一個文本字數計算程序，帶着幾分 Code smells，咱們將經過 pprof 逐步分析程序中到底有哪些問題。web

package main

import (
	"fmt"
	"io"
	"log"
	"os"
	"unicode"
)

func readbyte(r io.Reader) (rune, error) {
	var buf [1]byte
	_, err := r.Read(buf[:])
	return rune(buf[0]), err
}

func main() {
	f, err := os.Open(os.Args[1])
	if err != nil {
		log.Fatalf("could not open file %q: %v", os.Args[1], err)
	}

	words := 0
	inword := false
	for {
		r, err := readbyte(f)
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatalf("could not read file %q: %v", os.Args[1], err)
		}
		if unicode.IsSpace(r) && inword {
			words++
			inword = false
		}
		inword = unicode.IsLetter(r)
	}
	fmt.Printf("%q: %d words\n", os.Args[1], words)
}
複製代碼

這裏咱們有一份用於測試程序的文本文件，1.2 M 大小。數組

$ ls -lh moby.txt
-rw-r--r--  1 f1renze  staff   1.2M Jan 19 16:32 moby.txt
複製代碼

咱們跑一下程序看看，讀取 + 單詞計數一共花了 2 秒鐘，這好像不太行。緩存

$ time go run main.go moby.txt
"moby.txt": 181275 words
        2.13 real         1.41 user         1.81 sys
複製代碼

咱們先 profile 一下 CPU，以下在代碼中加上生成 profile 文件的代碼段：bash

func main() {
  # 添加如下代碼
	cpuProfile, _ := os.Create("cpu_profile")
	pprof.StartCPUProfile(cpuProfile)
	defer pprof.StopCPUProfile()
	
	....
複製代碼

從新執行程序而後以 web 形式啓動 pprof（須要先安裝 Graphviz）：併發

go tool pprof -http=:8081 cpu_profile
複製代碼

默認界面就是一個詳細的函數調用關係、耗時圖。這裏主要有三個分支。svg

先來看最右邊，syscall.syscall 共佔用 cpu 0.93s，爲啥在 syscall 上花了這麼多時間嘞？咱們帶着問題看一下原始代碼，看到在 for 循環中程序不斷調用 readbyte 方法讀取文件：

...
for {
		r, err := readbyte(f)
		if err == io.EOF {
			break
		}
...
複製代碼

而在此方法中，每次都只讀取一個字節：

func readbyte(r io.Reader) (rune, error) {
	var buf [1]byte
	_, err := r.Read(buf[:])
	return rune(buf[0]), err
}
複製代碼

問題就出在這！

正是未緩衝的頻繁讀取致使了長時間的 syscall 佔用，這還會致使另一個問題，看調用棧的左邊，runtime.pthread_cond_signal 與 runtime.pthread_cond_wait 分別花費了 0.58s 與 1s。

因爲 go 的協程由內置的調度器進行調度，即 GMP 模型，每一個 P 分配一個 M （實際 OS 線程）執行 G，當執行 G 的 M 陷入同步阻塞系統調用（如文件 IO）時，P 會被轉移到新的 M （新建的 OS 線程或線程緩存）上。以下圖：

M1 執行 G 時陷入系統調用，調度器將 M1 與 P 分離（G1 仍與 M1 鏈接），併爲 P 分配 M2 ，而後從 P 的 Local Run Queue 選擇 G2 執行（此時在 M2 上進行上下文切換）。

當 G1 發起的同步阻塞調用執行完時，G1 會被移回 P 的 Local Run Queue 。而 M1 加入線程緩存中。

因此真正的緣由是：屢次系統調用帶來的開銷，加上陷入系統調用的 M 致使 P 的轉移，使得程序性能很是差。

查明緣由後咱們優化一下代碼：

words := 0
inword := false
# 返回帶緩衝的 io.Reader
b := bufio.NewReader(f)
for {
  # 由 buffer 中讀取
  r, err := readbyte(b)
  if err == io.EOF {
    break
  }
複製代碼

查看運行時間，僅爲以前的執行時間的零頭：

$ time go run main.go moby.txt
"moby.txt": 181275 words
        0.63 real         0.29 user         0.25 sys
複製代碼

再次查看 CPU profile ，發現只剩下一個分支了，由於程序中只有一次系統調用，耗時幾乎能夠忽略不計：

顯然針對 CPU 的 profile 不能提供給咱們更多信息了，接下來咱們 profile 一下程序內存使用。

爲了演示此處將採樣率修改成 1，profiler 會收集所有內存分配信息。（通常不推薦這麼作，由於這會拖慢程序運行效率）

func main() {
# 將以前 profile CPU 的代碼註釋掉後添加如下代碼
memProfile, _ := os.Create("mem_profile")
runtime.MemProfileRate = 1

defer func() {
  pprof.WriteHeapProfile(memProfile)
  memProfile.Close()
}()
...
複製代碼

查看調用棧及內存分配展現

能夠看到 main.readbyte 分配了 1.2 m 的內存，與 moby.txt 文件大小相同，點擊 source 查看內存分配具體位置：

對比一下 main 函數與 readbyte 函數，能夠看到 bufio 返回的 reader 僅佔 4kb 內存而 readbyte 中的 buf 數組卻分配了 1.2 mb 的內存！

因爲 buf 這個數組在函數中聲明，理論上應該被分配函數棧空間上，隨着函數返回直接釋放。然而 profile 結果彷佛不是這樣，咱們來看下編譯器的逃逸分析日誌：

$ go build -gcflags=-m main.go
# command-line-arguments
...
./main.go:15:6: moved to heap: buf
./main.go:43:21: moved to heap: buf
....
複製代碼

省略無關信息，此處能夠看到編譯器將本是函數中局部變量的 buf 分配到堆上。因爲變量被分配到堆或棧上是由 go 編譯器決定的，每次進入 readbyte 函數時聲明的 buf 都被分配到堆上，加起來的內存大小即爲文件自己的大小。

咱們將 buf 改成全局變量再次 profile，問題解決：

使用 Trace 分析 Groutine 執行過程

pprof 足夠強大，但一般 Go 程序中包含了大量的併發操做，此時 profile CPU 或內存或許已經不能提供更多有用的信息。而官方提供了超級好用的 trace 幫助咱們進一步分析高併發程序的性能。

這是一個支持多種工做模式的繪製曼德勃羅分形圖（Mandelbrot set）的程序，其主要工做是根據每一個像素的座標計算出值並填充顏色，無需瞭解具體細節由於本文的重點是 trace：

// mandelbrot example code adapted from Francesc Campoy's mandelbrot package.
// https://github.com/campoy/mandelbrot
package main

import (
	"flag"
	"image"
	"image/color"
	"image/png"
	"log"
	"os"
	"sync"
)

func main() {
	var (
		height  = flag.Int("h", 1024, "height of the output image in pixels")
		width   = flag.Int("w", 1024, "width of the output image in pixels")
		mode    = flag.String("mode", "seq", "mode: seq, px, row, workers")
		workers = flag.Int("workers", 1, "number of workers to use")
	)
	flag.Parse()

	const output = "mandelbrot.png"

	// open a new file
	f, err := os.Create(output)
	if err != nil {
		log.Fatal(err)
	}

	// create the image
	c := make([][]color.RGBA, *height)
	for i := range c {
		c[i] = make([]color.RGBA, *width)
	}

	img := &img{
		h: *height,
		w: *width,
		m: c,
	}

	switch *mode {
	case "seq":
		seqFillImg(img)
	case "px":
		oneToOneFillImg(img)
	case "row":
		onePerRowFillImg(img)
	case "workers":
		nWorkersPerRowFillImg(img, *workers)
	default:
		panic("unknown mode")
	}

	// and encoding it
	if err := png.Encode(f, img); err != nil {
		log.Fatal(err)
	}
}

type img struct {
	h, w int
	m    [][]color.RGBA
}

func (m *img) At(x, y int) color.Color { return m.m[x][y] }
func (m *img) ColorModel() color.Model { return color.RGBAModel }
func (m *img) Bounds() image.Rectangle { return image.Rect(0, 0, m.h, m.w) }

// SEQSTART OMIT
func seqFillImg(m *img) {
	for i, row := range m.m {
		for j := range row {
			fillPixel(m, i, j)
		}
	}
}

// SEQEND OMIT

func oneToOneFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h * m.w)
	for i, row := range m.m {
		for j := range row {
			go func(i, j int) {
				fillPixel(m, i, j)
				wg.Done()
			}(i, j)
		}
	}
	wg.Wait()
}

func onePerRowFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h)
	for i := range m.m {
		go func(i int) {
			for j := range m.m[i] {
				fillPixel(m, i, j)
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
}

func nWorkersFillImg(m *img, workers int) {
	c := make(chan struct{ i, j int })
	for i := 0; i < workers; i++ {
		go func() {
			for t := range c {
				fillPixel(m, t.i, t.j)
			}
		}()
	}

	for i, row := range m.m {
		for j := range row {
			c <- struct{ i, j int }{i, j}
		}
	}
	close(c)
}

func nWorkersPerRowFillImg(m *img, workers int) {
	c := make(chan int, m.h)
	var wg sync.WaitGroup
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			for row := range c {
				for col := range m.m[row] {
					fillPixel(m, row, col)
				}
			}
			wg.Done()
		}()
	}

	for row := range m.m {
		c <- row
	}
	close(c)
	wg.Wait()
}

func fillPixel(m *img, x, y int) {
	const n = 1000
	const Limit = 2.0
	Zr, Zi, Tr, Ti := 0.0, 0.0, 0.0, 0.0
	Cr := (2*float64(x)/float64(n) - 1.5)
	Ci := (2*float64(y)/float64(n) - 1.0)

	for i := 0; i < n && (Tr+Ti <= Limit*Limit); i++ {
		Zi = 2*Zr*Zi + Ci
		Zr = Tr - Ti + Cr
		Tr = Zr * Zr
		Ti = Zi * Zi
	}
	paint(&m.m[x][y], Tr, Ti)
}

func paint(c *color.RGBA, x, y float64) {
	n := byte(x * y * 2)
	c.R, c.G, c.B, c.A = n, n, n, 255
}
複製代碼

咱們跑一下看看，有點慢：

$ time go run mandelbrot.go 
        1.93 real         1.70 user         0.27 sys
複製代碼

因爲默認工做模式是順序執行計算，程序自己除了建立文件及圖片編碼操做也沒有其餘額外的 IO 操做，代碼上進一步優化的空間有限。這裏咱們直接使用 trace 看一下分析結果，首先加入生成 trace 代碼段：

func main() {
	var (
		height  = flag.Int("h", 1024, "height of the output image in pixels")
		width   = flag.Int("w", 1024, "width of the output image in pixels")
		mode    = flag.String("mode", "seq", "mode: seq, px, row, workers")
		workers = flag.Int("workers", 1, "number of workers to use")
	)
	flag.Parse()
	
	# 加入如下代碼段
	var fn string
	switch *mode {
	case "seq":
		fn = "trace.seq"
	case "px":
		fn = "trace.px"
	case "row":
		fn = "trace.row"
	case "workers":
		fn = "trace.workers"
	}
	traceFile, _ := os.Create(fn)
	if err := trace.Start(traceFile); err != nil {
		log.Fatal(err)
	}
	defer trace.Stop()
	
	....
複製代碼

運行程序後使用 go tool trace trace.seq 在 Chrome 中查看：

Tips: Shift + ? 查看命令幫助，w/s 放大 / 縮小

細心的同窗可能會發現，只有 Proc 0 一直在工做，其餘幾乎都處於閒置狀態，因爲不一樣像素的計算任務彼此獨立，據此能夠將計算任務交由不一樣的 Goroutine 來完成。

既然如此，咱們換個工做模式，直接並行化計算每一個像素，程序中已經給出了實現：

func oneToOneFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h * m.w)
	for i, row := range m.m {
		for j := range row {
			go func(i, j int) {
				fillPixel(m, i, j)
				wg.Done()
			}(i, j)
		}
	}
	wg.Wait()
}
複製代碼

咱們切換到對應的 px 工做模式看看：

$ time go run mandelbrot.go -mode px
        2.01 real         7.26 user         2.90 sys
複製代碼

好吧，很是奇怪的是程序竟然更慢了，咱們使用 go tool trace trace.seq 看一下到底是怎麼一回事。

生成的 trace 文件大小取決於 Goroutine 數量，運行這條命令會比較慢

因爲 Goroutine 數量過多，能夠看到 trace 文件被分爲許多片斷，查看其中一個片斷：

這回沒有空閒的 Proc 了，但執行流彷佛斷斷續續的，放大後看很是明顯：

因此問題就出在這，併發粒度過小了，每一個 Goroutine 的工做量過小，甚至抵不上啓動及調度的開銷。天下沒有免費的午飯，過多的 Groutine 也會帶來額外的負擔。

接下來咱們調整一下併發粒度，將每一行的計算任務分配到不一樣的 Goroutine 上，對應的模式及源碼以下：

time go run mandelbrot.go -mode row
        0.85 real         1.85 user         0.32 sys
複製代碼

func onePerRowFillImg(m *img) {
	var wg sync.WaitGroup
	wg.Add(m.h)
	for i := range m.m {
		go func(i int) {
			for j := range m.m[i] {
				fillPixel(m, i, j)
			}
			wg.Done()
		}(i)
	}
	wg.Wait()
}
複製代碼

能夠看到明顯比以前兩個模式要快上許多，看下 trace 如何：

像素計算部分執行流看起來極度溫馨，而且充分利用了 CPU 資源，12 個線程所有跑滿（超線程的 6 核 CPU）。放大後能夠看到 Goroutine 之間幾乎沒有空隙，不由讚歎 Go 纔是世界上最好的語言😆。

固然，這種模式依然有其缺點，注意到 trace 界面最上方 Groutine 一欄中有一個明顯的波動，儘管咱們調整了併發粒度其依然會在短期內產生大量 Goroutine 而後依次執行它們。實際上因爲 CPU 核數是固定的，同一時間並行執行 Goroutine 的 OS 線程數也是固定的，由此咱們能夠將 Goroutine 池化複用，節省資源，其一般被稱爲 Worker Pool。

Worker Pool 有很是多種實現，可是卻大同小異，首先啓動必定的 Goroutine 做爲 Worker，而後 Worker 經過 Channel 消費須要執行的 Task。程序中給出了一個簡單實現以下：

func nWorkersPerRowFillImg(m *img, workers int) {
	c := make(chan int, m.h)
	var wg sync.WaitGroup
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			for row := range c {
				for col := range m.m[row] {
					fillPixel(m, row, col)
				}
			}
			wg.Done()
		}()
	}

	for row := range m.m {
		c <- row
	}
	close(c)
	wg.Wait()
}
複製代碼

咱們看一下執行時間，比 row 模式要快一些。這裏要注意的是須要手動指定 Worker 的數量，等同於當前硬件最大支持的線程數，能夠在 Go 中使用 fmt.Println(runtime.NumCPU()) 查看。

time go run mandelbrot.go -mode workers -workers 12
        0.74 real         1.86 user         0.26 sys
複製代碼

查看 trace 能夠看到計算階段始終保持着等同於 Worker 數量的 Groutine 在運行

Worker Pool 很是好用但也不是銀彈，不一樣方案適用於不一樣場景，此處不過多討論。回到當前程序自己，這幾種方案最終加速的都是計算任務部分，要想加速圖像編碼部分難度無疑會提高几個檔次。

Amdahl's law

而並行化可以帶來的加速也不是無限的。如上圖，阿姆達爾定律（Amdahl's law）表示了並行處理器數量與效率提高的關係，加速程序的關鍵取決於其中必須串行執行的部分（好比 mandelbrot 這個程序最快時也須要花費約 50% 的時間用於編碼圖片）。當程序中 95% 的執行均可以並行化時，即便將 CPU 數量提高到數千個，加速效率也只能限制在 20 倍。

總結

這真的是一個很是棒的演講，深刻淺出地講述了 Go 性能分析的三種方式，其中還涉及了許多底層的知識點。以爲意猶未盡的朋友能夠看看做者的相關文章 👉 dave.cheney.net/high-perfor…

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。