golang--性能測試和分析

時間 2019-11-08

標籤 golang 性能測試分析欄目 Go 简体版

原文原文鏈接

前言

測試分爲：壓力測試、負載測試、性能測試，功能測試等等，其中在開發過程當中開發人員常常要寫一些test case unit 本身的模塊進行功能測試測和性能。在分析出模塊的性能瓶頸後開發人員就須要針對性的調優，但須要提醒的是調優工程通常要放在最後在進行，過早地優化會浪費開發時間，並且有時在需求或者功能變更後就會變成無用功，這是顯而易見的。
隨着移動互聯網的興起，各類遠端服務的性能遇到巨大的挑戰，雖然硬件資源日益豐富，但成本的增長和用戶體驗的嚴苛，性能調優在軟件開發中佔的權重也愈來愈高，尤爲是一些常常被調用的模塊，如核心算法的部分，即便把每次執行的cpu運行的時間下降1%，把內存消耗下降1%，在面對海量的用戶請求時對性能也會帶來提高。
Go語言做爲一個「新興語言」，它自身爲咱們提供了一套用於性能測試和分析的工具和接口，做者這裏結合一個具體實例來說解一下簡單的調優過程。html

經常使用的測試命令

啓動指定的測試函數node
```
go test -run=xxx $packetpath
```
編譯一個測試binary文件git
```
go test -c -o $binaryname
```
啓動一個bench測試github
```
go test -bench=xxx -run=None
```
在 -bench 的選項上設置一個None表明着不運行功能測試函數即Test開頭的函數，-bench後面接的是一個匹配項，任何符合該匹配項的Bench測試函數都會被觸發執行。golang
```
go test -bench=xxx -run=None -benchmem
```
添加 -benchmem 會顯示出測試過程當中分配mem的數量。web
啓動profile分析算法

在go 自帶的工具中支持三種profile收集： cpu、block、memory，cpuprofile 是用來記錄各個事件或函數對cpu的消耗狀況的數據；blockprofile是爲了記錄攜程阻塞的操做，能夠用來分析每一個阻塞操做佔用的時間；memprofile是用於堆性能剖析，記錄攜程內部堆分配的具體狀況。windows
```
go test -bench=xxx -run=None -cpuprofile=cpu.out
    go test -bench=xxx -run=None -blockprofile=block.out
    go test -bench=xxx -run=None -memprofile=mem.out
```
不管是cpu分析、阻塞分析、堆分析，主要目的都是爲了消耗更少的計算資源、內存資源來取得更高的效率，但有時候時間和空間並不能兼得，用空間換時間是很經常使用的一種作法，須要根據具體狀況進行權衡。瀏覽器
枚舉包內的文件併發
```
# 枚舉一個包中的.go 文件
    go list -f={{.GoFiles}}

    # 枚舉一個包中的_test.go文件 
    go list -f={{.TestGoFiles}} 

    # 枚舉一個包中的外部測試文件
    go list -f={{.XTestGoFiles}}
```
外部測試文件：在一些特殊狀況下，外部測試包須要對待測試包有特殊的訪問權限，例如爲了不循環引用，一個白盒測試必須存在一個單獨的包中，在這種狀況下咱們使用了一種小技巧：在包內測試未見中添加一些函數聲明，將包內部的功能暴露給外部測試，這種作法爲爲包測試留了一個「後門」。若是一個源文件的存在的惟一目的就在於此，而且本身不包含任何測試，它們通常稱做export_test.go
測試覆蓋率
在測試的時候有一種最簡單的統計方法--語句覆蓋率，部分語句在一次測試過程當中至少執行了一次，則稱之爲覆蓋，不然爲非覆蓋。
```
# 執行測試並生成coverprofile 文件
    go test -run=TransferProofGenerator -cover -coverprofile=c.out

    # 查看覆蓋率，經過go tool 工具
    go tool cover -html=c.out // 會在瀏覽器中打開一個測試報告
```
若是在生成profile的命令中加入了 -covermode=count 的標記，則能夠標記出執行效率較高和較低的模塊。

性能剖析流程

這裏做者會結合一個示例來講明一下性能剖析的具體操做流程。性能調優首要的任務就是找出消耗時間最多的函數。

收集程序的profile

通常收集有兩種方法，一種是經過工具命令來收集，上面已經提到了；另外一種是經過調用相關模塊來收集，這裏面以cpuprofile爲例：
```
import( 
    "runtime/pprof"
    "testing"
    "os"
)
func BenchmarkXXX(b *testing.B){
    fd, _ := os.OpenFile("cpu.log", os.O_CREATE| os.O_RDWR, os.ModeSetuid)

    if err := pprof.StartCPUProfile(fd); err != nil{
        b.Fatalf("TestTransferProofValidator %s", err.Error())
    }

    defer func() {
        pprof.StopCPUProfile()
        fd.Close()
    }()
}
```
這樣在執行後自動就會生成cpuprofile文件，這裏是將函數放在裏Bench測試函數中，也能夠將它放在正常函數中好比main函數。有的時候須要實時的監聽web服務的運行狀態，這個時候就能夠經過引入net/http/pprof來收集profile：
```
package main

import (
    "net/http"
    "github.com/utxo/webserver/router"
    _ "net/http/pprof"
)

func main()  {
    handler:= http.FileServer(http.Dir("./template"));
    http.Handle("/", handler)
    http.HandleFunc("/transfer", router.Transfer);
    http.ListenAndServe("0.0.0.0:8082", nil);
}
```
此時會默認開啓proof收集服務，go tool做爲客戶端來訪問該router來收集profile:
```
go tool pprof  http://localhost:8082/debug/pprof/profile
```
在收集結束後會進入cmd命令界面（默認進行 30s 的 CPU Profiling收集），經過命令行來查看、處理收集到的數據。
查看profile數據
```
go tool pprof -text -nodecount=10  cpu.log
```
-nodecount 函數會爲咱們列出cpu時間消耗前十的函數。也能夠進入cmd模式，進行更詳細的剖析：
```
go tool pprof list  cpu.log
```

詳細示例

做者這裏展現的是一個零知識證實的生成和校驗的算法的性能分析過程，按照上節所說咱們先收集profile，使用的方法是調用pprof模塊的方法：

func BenchmarkTransferProofValidator(b *testing.B) {
    once.Do(testInit)
    fd, _ := os.OpenFile("cpu_validator.log", os.O_CREATE| os.O_RDWR, os.ModeSetuid)


    inputs, T := generateInputs(100, big.NewInt(1000000))
    proof, output, B, err := TransferProofGenerator(inputs, T, pairsA, pairsB,
        zkpKeyPairsA, zkpKeyPairsB, zkpRange)
    if err != nil{
        b.Fatalf("TestTransferProofValidator: %s", err.Error())
    }

    if err := pprof.StartCPUProfile(fd); err != nil{
        b.Fatalf("TestTransferProofValidator %s", err.Error())
    }
    defer func() {
        pprof.StopCPUProfile()
        fd.Close()
    }()

    for i := 0; i < b.N; i++ {
        err = TransferProofValidator(inputs, output, B, pairsA, pairsB,
            zkpKeyPairsA, zkpKeyPairsB, proof)
        if err != nil{
            b.Fatalf("TestTransferProofValidator: %s", err.Error())
        }
    }
}

執行以下命令：

go test -c -bench=BenchmarkTransferProofValidator -run=None

顯示結果以下：

D:\gopath\src\github.com\utxo\balance>go test -run=None -bench=Va
 goos: windows
 goarch: amd64
 pkg: github.com/utxo/balance
 BenchmarkTransferProofValidator-4              3         397296900 ns/op
 PASS
 ok      github.com/utxo/balance 3.237s

TransferProofValidator運行了三次，gomaxprocs = 4，這些都是go tool自動分配，暫時不用深究。能夠看到咱們的函數運行十分的緩慢，執行一次須要 0.4s 左右，它將做爲一個高頻調用的模塊遠遠達不到使用的需求。

接下來咱們要找到耗時最多的那個部分：

go tool pprof list  cpu_validator.log

顯示結果以下：

D:\gopath\src\github.com\utxo\balance> go tool pprof list  cpu_validator.log
 list: open list: The system cannot find the file specified.
 Fetched 1 source profiles out of 2
 Main binary filename not available.
 Type: cpu
 Time: Dec 20, 2018 at 11:15pm (CST)
 Duration: 625.81ms, Total samples = 420ms (67.11%)
 Entering interactive mode (type "help" for commands, "o" for options)
 (pprof) top
 Showing nodes accounting for 420ms, 100% of 420ms total
 Showing top 10 nodes out of 25
     flat  flat%   sum%        cum   cum%
     240ms 57.14% 57.14%      240ms 57.14%  math/big.addMulVVW
     110ms 26.19% 83.33%      380ms 90.48%  math/big.nat.montgomery
     30ms  7.14% 90.48%       30ms  7.14%  runtime.memmove
     10ms  2.38% 92.86%       10ms  2.38%  math/big.(*Int).QuoRem
     10ms  2.38% 95.24%       10ms  2.38%  math/big.nat.add
     10ms  2.38% 97.62%       10ms  2.38%  runtime.lock
     10ms  2.38%   100%       20ms  4.76%  runtime.newstack
         0     0%   100%      400ms 95.24%  github.com/utxo/balance.BenchmarkTransferProofValidator
         0     0%   100%      400ms 95.24%  github.com/utxo/balance.TransferProofValidator
         0     0%   100%      400ms 95.24%  github.com/utxo/crypto/base.Exp

每一行表示一個函數的信息。前兩列表示函數在 CPU 上運行的時間以及百分比；第三列是當前全部函數累加使用 CPU 的比例；第四列和第五列表明這個函數以及子函數運行所佔用的時間和比例（也被稱爲累加值 cumulative），應該大於等於前兩列的值；最後一列就是函數的名字。若是應用程序有性能問題，上面這些信息應該能告訴咱們時間都花費在哪些函數的執行上了。

以上面顯示的結果爲例，根據cum來觀察大部分的時間都花費在了Exp的執行過程當中（base.Exp 函數在其餘模塊內被調用），咱們能夠經過list命令來仔細分析一下Exp函數中在那部分比較耗時：

Total: 420ms
 ROUTINE ======================== github.com/utxo/crypto/base.Exp in D:\gopath\src\github.com\utxo\crypto\base\base.go
         0      400ms (flat, cum) 95.24% of Total
         .          .     91:           return E
         .          .     92:   }
         .          .     93:
         .          .     94:   if x.Sign() < 1{
         .          .     95:           x = (&big.Int{}).Neg(x)
         .       30ms     96:           E = (&big.Int{}).Exp(g, x, n)
         .          .     97:           if 1 != Gcd(E, n).Int64(){
         .          .     98:                   log.Printf("h_r is not prime with n")
         .          .     99:                   return nil
         .          .    100:           }
         .          .    101:
         .       20ms    102:           E = (&big.Int{}).ModInverse(E, n)
         .          .    103:           E = (&big.Int{}).Mod(E, n)
         .          .    104:
         .          .    105:   }else {
         .      350ms    106:           E = (&big.Int{}).Exp(g, x, n)
         .          .    107:   }
         .          .    108:
         .          .    109:   return E
         .          .    110:}
         .          .    111:

如上所示，base.Exp 總計佔用了 95.24% 的cpu時間，而整個函數內最耗時的就是 big.Exp() 函數。這就給做者製造了個難題，由於在本模塊中涉及到大量的指數運算，並且輸入數據都較大（平均 512bit）。

由於本模塊的功能相對比較簡單，因此比較容易的定位到了問題的所在，接下來咱們要想辦法進行調優。
由於本模塊目前是純粹的算法模塊，能夠考慮進行併發改造，來發揮go協程的特色，尤爲是在調用base.Exp的部分中。但做者將嘗試併發改造後發現改造的結果並不理想，由於使用channel進行同步致使阻塞，‘抵消’了多協程帶來的性能提高。

另一個思路就是儘可能避免big.Exp指數運算，即將指數型運算轉換爲其餘時間複雜度較低的運算，多協程是利用go自己的特性和cpu的多核運算。而這一種方法則純粹是從數學角度上進行優化，並不具備普適性。改造後再進行測試顯示結果以下：
```
D:\gopath\src\github.com\utxo\balance>go test -run=None -bench=Va
 goos: windows
 goarch: amd64
 pkg: github.com/utxo/balance
 BenchmarkTransferProofValidator-4            200           5096903 ns/op
 PASS
 ok      github.com/utxo/balance 3.753s
```
雖然在零知識證實校驗過程當中多協程並未帶來顯著的效果提高，但在零知識證實生成的過程當中起到了很大的做用。