經過內存分配來學習 go 中的機制

時間 2020-06-22

標籤經過內存分配學習機制简体版

原文原文鏈接

前言

在前一篇博客中，我介紹了逃逸分析的基礎場景。可是還有一些其餘場景，我並無作介紹。爲了介紹其餘場景，我專門寫了了一個程序用於 debug，這個程序中分配內存的方式比較讓人吃驚。正則表達式

程序

爲了更多的學習io包，我嘗試了一個快速的項目。找到字節流中的字符串 elvis，而且替換爲首字母大寫的字符串 Elvis。算法

代碼中列出了兩個用於解決這個這個問題的函數。這個博客主要集中於函數algOne，由於這個函數用到了io包。shell

下面的數據中，一個是輸入，一個是但願經過函數algOne做用以後的輸出。數組

Listing 1bash

Input:
abcelvisaElvisabcelviseelvisaelvisaabeeeelvise l v i saa bb e l v i saa elvi
selvielviselvielvielviselvi1elvielviselvis

Output:
abcElvisaElvisabcElviseElvisaElvisaabeeeElvise l v i saa bb e l v i saa elvi
selviElviselvielviElviselvi1elviElvisElvis
複製代碼

下面是函數algOneide

Listing 2函數

80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := io.ReadFull(input, buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := io.ReadFull(input, buf[:end]); err != nil {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
複製代碼

我想知道這個函數的表現以及函數給堆上的壓力。爲了瞭解這些，咱們須要運行下 benchmark。工具

Benchmarking

下面是用來運行函數algOne來處流數據的 benchmark 函數性能

Listing 3學習

15 func BenchmarkAlgorithmOne(b *testing.B) {
16     var output bytes.Buffer
17     in := assembleInputStream()
18     find := []byte("elvis")
19     repl := []byte("Elvis")
20
21     b.ResetTimer()
22
23     for i := 0; i < b.N; i++ {
24         output.Reset()
25         algOne(in, find, repl, &output)
26     }
27 }
複製代碼

有了這個函數，咱們就能夠運行go test了，而且可使用選項-bench，-benchtime和-benchmem選項。

Listing 4

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	2000000 	     2522 ns/op       117 B/op  	      2 allocs/op
複製代碼

在運行 benchmark 以後，咱們能夠看到函數algOne函數的每次操做都分配了兩次內存，而且分配的內存大小爲 117 字節。這個表現很是好了，可是咱們須要知道是哪些代碼形成了這些內存的分配。爲了知道這些，咱們須要產生運行 benchmark 的 profiling data。

Profiling

爲了產生 profile data，咱們須要運行 benchmark，不過此次須要使用選項 -memprofile選項。

Listing 5

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op
複製代碼

在程序運行完以後，就會產生兩個新的文件。

Listing 6

~/code/go/src/.../memcpu
$ ls -l
total 9248
-rw-r--r--  1 bill  staff      209 May 22 18:11 mem.out       (NEW)
-rwxr-xr-x  1 bill  staff  2847600 May 22 18:10 memcpu.test   (NEW)
-rw-r--r--  1 bill  staff     4761 May 22 18:01 stream.go
-rw-r--r--  1 bill  staff      880 May 22 14:49 stream_test.go
複製代碼

源碼所在的文件夾爲memcpu，函數algOne就存在於文件stream.go中，函數BenchmarkAlgorithmOne存在於stream_test.go。兩個產生的文件分別是mem.out和memcpu.test。文件mem.out包含了 profiles data。文件memcpu.test是一個二進制文件，當咱們須要看 profile data 的時候須要使用到這個文件。

有了 profile data 和二進制文件，咱們就能夠運行pprof工具來學習 profile data。

Listing 7

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) _
複製代碼

當須要 profiling memory 而且尋找容易解決的問題的時候，咱們須要使用選項-alloc_space而不是默認的選項-inuse_space。這個選項會展現每次分配內存的狀況，而無論你 take the profile 的時候，分配的內存是否還在使用。

經過pprof的做用，咱們可使用list命令來檢查函數algOne的狀況。list命令接受一個正則表達式，用於匹配表達式匹配的函數。

Listing 8

(pprof) list algOne
Total: 335.03MB
ROUTINE ======================== .../memcpu.algOne in code/go/src/.../memcpu/stream.go
 335.03MB   335.03MB (flat, cum)   100% of Total
        .          .     78:
        .          .     79:// algOne is one way to solve the problem.
        .          .     80:func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
        .          .     81:
        .          .     82: // Use a bytes Buffer to provide a stream to process.
 318.53MB   318.53MB     83: input := bytes.NewBuffer(data)
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
  16.50MB    16.50MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := io.ReadFull(input, buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
(pprof) _
複製代碼

基於這個 profile，咱們能夠知道input以及切片buf的底層數組被分配到了堆。因爲input是指針，因此這個 profile 是說明，input所指向的bytes.Buffer是分配的到堆的。因此咱們先聚焦於變量input的變量的分配，而且理解是如何分配的。

因爲函數bytes.NewBuffer建立的變量，和函數algOne共享，因此致使變量分配到堆。而且flat列(pprof 輸出的第一列)出現的值告訴咱們這個值是分配到堆的，由於函數algOne共享變量的緣由致使的變量分配逃逸到堆。

flat列表示的是函數的堆的分配，能夠看看list命令展現函數Benchmark是如何調用函數algOne的。

Listing 9

(pprof) list Benchmark
Total: 335.03MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
        0   335.03MB (flat, cum)   100% of Total
        .          .     18: find := []byte("elvis")
        .          .     19: repl := []byte("Elvis")
        .          .     20:
        .          .     21: b.ResetTimer()
        .          .     22:
        .   335.03MB     23: for i := 0; i < b.N; i++ {
        .          .     24:       output.Reset()
        .          .     25:       algOne(in, find, repl, &output)
        .          .     26: }
        .          .     27:}
        .          .     28:
(pprof) _
複製代碼

因爲只有第二列cum纔有值，因此函數Benchmark函數並不直接的建立任何變量到堆的。在循環內部，每次對函數調用的時候都會分配變量到堆。你能夠看到兩次對list命令調用的時候，分配的值到堆是匹配的(譯者注：$$318.53 + 16.50 = 335.03$$)。

到此呢，咱們仍然不知道爲何bytes.Buffer會建立變量到堆。這個時候可使用go build命令的-gcflags "-m -m"選項了。profiler會告訴咱們值逃逸到的堆，而go build命令會告訴咱們爲何。

編譯器報告

咱們可讓編譯器告訴咱們代碼裏面變量逃逸到堆的緣由。

Listing 10

$ go build -gcflags "-m -m"
複製代碼

這個命令會產生很是多的輸出。咱們須要找到的就是包含stream.go:83的行，由於stream.go是文件的名稱，而且第 83 行含有代碼來構建bytes.buffer的值。在搜索以後，找到了以下 6 行。

Listing 11

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
複製代碼

第一行是很是有意思的

Listing 12

./stream.go:83: inlining call to bytes.NewBuffer func([]byte) *bytes.Buffer { return &bytes.Buffer literal }
複製代碼

這句話告訴了咱們bytes.Buffer逃逸到堆的緣由並非對函數bytes.Buffer調用形成的。由於bytes.Buffer壓根沒有被調用，函數的操做被內聯到了調用的地方。

第 83 行的的以下代碼

Listing 13

83     input := bytes.NewBuffer(data)
複製代碼

因爲編譯器選擇把bytes.NewBuffer內聯到代碼裏面，因此上面的代碼在實際調用的時候是以下的

Listing 14

input := &bytes.Buffer{buf: data}
複製代碼

這就意味着函數algOne是直接建立bytes.Buffer的。那麼究竟是什麼致使 input 被分配到堆中的呢？答案就在剩下的五行報告中。

Listing 15

./stream.go:83: &bytes.Buffer literal escapes to heap
./stream.go:83:   from ~r0 (assign-pair) at ./stream.go:83
./stream.go:83:   from input (assigned) at ./stream.go:83
./stream.go:83:   from input (interface-converted) at ./stream.go:93
./stream.go:83:   from input (passed to call[argument escapes]) at ./stream.go:93
複製代碼

上面的這些內容告訴咱們是第 93 行形成的值逃逸的。由於input變量被賦值給了一個接口。

接口

我並無印象在代碼中對接口有過賦值的操做。可是若是看了第 93 行代碼，問題就變得清晰了。

Listing 16

93     if n, err := io.ReadFull(input, buf[:end]); err != nil {
 94         output.Write(buf[:n])
 95         return
 96     }
複製代碼

因爲調用了io.ReadFull函數，因此形成了對接口的賦值。若是你看了io.ReadFull的定義，你能夠看到函數io.ReadFull接受的第一個參數是一個接口。

Listing 17

type Reader interface {
      Read(p []byte) (n int, err error)
}

func ReadFull(r Reader, buf []byte) (n int, err error) {
      return ReadAtLeast(r, buf, len(buf))
}
複製代碼

這個說明了，把bytes.Buffer的地址傳遞給函數，而後函數把這個地址做爲一個接口存儲，這就形成了變量逃逸到了堆。如今咱們看到了使用接口的代價：變量分配到堆和變量的間接使用(若是分配到棧，變量的訪問速度會更快)。若是使用接口並無使得代碼變得更好，那就最好別使用接口。我跟隨這下面這些指導來使用接口

當有下面幾種狀況的時候，我會使用接口

用戶須要本身實現接口的細節
API 有許多實現方法，須要各自維護其細節
API 的部分操做隨着時間會改變，須要解耦

不須要使用接口的狀況以下

爲了使用接口而使用接口
用於完成一個算法
當用戶能夠本身定義接口的時候

如今咱們須要問本身，這個算法真的須要使用io.ReadFull函數嗎？答案是否認的，由於bytes.Buffer類型有一系列方法可使用，而且使用這些方法能夠有效的避免變量被分配到堆。

如今咱們能夠移去io包，並使用input變量已有的方法Read。

下面的代碼移去了io包，爲了保持新的代碼行和原來的代碼行不變，使用了變量_來避免導入io包。這樣就能夠保持io包還在引入的行列中。

Listing 18

12 import (
 13     "bytes"
 14     "fmt"
 15     _ "io"
 16 )

 80 func algOne(data []byte, find []byte, repl []byte, output *bytes.Buffer) {
 81
 82     // Use a bytes Buffer to provide a stream to process.
 83     input := bytes.NewBuffer(data)
 84
 85     // The number of bytes we are looking for.
 86     size := len(find)
 87
 88     // Declare the buffers we need to process the stream.
 89     buf := make([]byte, size)
 90     end := size - 1
 91
 92     // Read in an initial number of bytes we need to get started.
 93     if n, err := input.Read(buf[:end]); err != nil || n < end {
 94         output.Write(buf[:n])
 95         return
 96     }
 97
 98     for {
 99
100         // Read in one byte from the input stream.
101         if _, err := input.Read(buf[end:]); err != nil {
102
103             // Flush the reset of the bytes we have.
104             output.Write(buf[:end])
105             return
106         }
107
108         // If we have a match, replace the bytes.
109         if bytes.Compare(buf, find) == 0 {
110             output.Write(repl)
111
112             // Read a new initial number of bytes.
113             if n, err := input.Read(buf[:end]); err != nil || n < end {
114                 output.Write(buf[:n])
115                 return
116             }
117
118             continue
119         }
120
121         // Write the front byte since it has been compared.
122         output.WriteByte(buf[0])
123
124         // Slice that front byte out.
125         copy(buf, buf[1:])
126     }
127 }
複製代碼

當咱們再次運行 benchmark 的時候，就能夠看到變量bytes.Buffer再也不分配到堆中了。

Listing 19

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem -memprofile mem.out
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op
複製代碼

也能夠從上面的輸出看到，代碼性能提高了約 29%。代碼花費的時間由 2570 ns/op 到 1814 ns/op。既然這個問題解決了，咱們如今就能夠聚焦於切片buf背後的數組分配到了堆的問題。若是咱們使用新的代碼，來運行獲得 profile 的結果，咱們也許就能夠解決這個問題了。

Listing 20

$ go tool pprof -alloc_space memcpu.test mem.out
Entering interactive mode (type "help" for commands)
(pprof) list algOne
Total: 7.50MB
ROUTINE ======================== .../memcpu.BenchmarkAlgorithmOne in code/go/src/.../memcpu/stream_test.go
     11MB       11MB (flat, cum)   100% of Total
        .          .     84:
        .          .     85: // The number of bytes we are looking for.
        .          .     86: size := len(find)
        .          .     87:
        .          .     88: // Declare the buffers we need to process the stream.
     11MB       11MB     89: buf := make([]byte, size)
        .          .     90: end := size - 1
        .          .     91:
        .          .     92: // Read in an initial number of bytes we need to get started.
        .          .     93: if n, err := input.Read(buf[:end]); err != nil || n < end {
        .          .     94:       output.Write(buf[:n])
複製代碼

如今惟一分配到堆的一行就是第 89 行了，這部分的分配就是切片底層的數組。

棧幀

咱們須要知道爲何buf底層的數組分配到了堆。再次運行go build指令，而且使用參數-gcflags "-m -m"，在輸出的結果中搜索stream.go:89。

Listing 21

$ go build -gcflags "-m -m"
./stream.go:89: make([]byte, size) escapes to heap
./stream.go:89:   from make([]byte, size) (too large for stack) at ./stream.go:89
複製代碼

報告中說的是分配的數組對於棧來講太大了。這個信息是很是的有迷惑性的。由於並非底層數組太大了，而是編譯器在編譯的時候不知道底層數組的大小。

只有在編譯器在編譯期間知道值的大小的時候，值纔會被分配到棧。這是由於每一個函數的棧幀的大小都是在編譯期間計算的。若是編譯器不知道一個值的大小，那麼編譯器會把值分配到堆上。

爲了展現這個，咱們暫時硬編碼切片的大小爲 5 到代碼中去

Listing 22

89     buf := make([]byte, 5)
複製代碼

這個時候再運行 benchmark，全部的分配到堆的操做都沒有了。

Listing 23

$ go test -run none -bench AlgorithmOne -benchtime 3s -benchmem
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
複製代碼

若是再次查看編譯器的報告，你會發現沒有變量的逃逸行爲

Listing 24

$ go build -gcflags "-m -m"
./stream.go:83: algOne &bytes.Buffer literal does not escape
./stream.go:89: algOne make([]byte, 5) does not escape
複製代碼

顯然，並不能硬編碼切片的大小到代碼中，因此代碼中知道存在着一次的變量分配到堆的操做。

分配和性能

有了三次的修改，咱們能夠查看、對比每次修改後的性能

Listing 25

Before any optimization
BenchmarkAlgorithmOne-8    	2000000 	     2570 ns/op       117 B/op  	      2 allocs/op

Removing the bytes.Buffer allocation
BenchmarkAlgorithmOne-8    	2000000 	     1814 ns/op         5 B/op  	      1 allocs/op

Removing the backing array allocation
BenchmarkAlgorithmOne-8    	3000000 	     1720 ns/op         0 B/op  	      0 allocs/op
複製代碼

在第一優化的時候，性能提高大約 29%。第二次優化以後，性能提高約 33%。經過這些數據，咱們能夠看到變量分配到堆是影響程序性能的。