Go 語言踩坑記——panic 與 recover

時間 2020-04-19

標籤語言 panic recover 简体版

原文原文鏈接

題記

Go 語言自發布以來，一直以高性能、高併發著稱。由於標準庫提供了 http 包，即便剛學不久的程序員，也能輕鬆寫出 http 服務程序。不過，任何事情都有兩面性。一門語言，有它值得驕傲的有點，也一定隱藏了很多坑。新手若不知道這些坑，很容易就會掉進坑裏。《 Go 語言踩坑記》系列博文將以 Go 語言中的 panic 與 recover 開頭，給你們介紹筆者踩過的各類坑，以及填坑方法。java

初識 panic 和 recover

panic panic 這個詞，在英語中具備恐慌、恐慌的等意思。從字面意思理解的話，在 Go 語言中，表明極其嚴重的問題，程序員最懼怕出現的問題。一旦出現，就意味着程序的結束並退出。Go 語言中 panic 關鍵字主要用於主動拋出異常，相似 java 等語言中的 throw 關鍵字。
recover recover 這個詞，在英語中具備恢復、復原等意思。從字面意思理解的話，在 Go 語言中，表明將程序狀態從嚴重的錯誤中恢復到正常狀態。Go語言中 recover 關鍵字主要用於捕獲異常，讓程序回到正常狀態，相似 java 等語言中的 try ... catch 。

筆者有過 6 年 linux 系統 C 語言開發經歷。C 語言中沒有異常捕獲的概念，沒有 try ... catch ，也沒有 panic 和 recover 。不過，萬變不離其宗，異常與 if error then return 方式的差異，主要體如今函數調用棧的深度上。以下圖：linux

正常邏輯下的函數調用棧，是逐個回溯的，而異常捕獲能夠理解爲：程序調用棧的長距離跳轉。這點在 C 語言裏，是經過 setjump 和 longjump 這兩個函數來實現的。例如如下代碼：c++

#include <setjmp.h>
#include <stdio.h>

static jmp_buf env;

double divide(double to, double by) {
    if(by == 0)
    {
        longjmp(env, 1);
    }
    return to / by;
}

void test_divide() {
    divide(2, 0);
    printf("done\n");
}

int main() {
    if (setjmp(env) == 0)
    {
        test_divide();
    }
    else
    {
        printf("Cannot / 0\n");
        return -1;
    }
    return 0;
}

複製代碼

因爲發生了長距離跳轉，直接從 divide 函數內跳轉到 main 函數內，中斷了正常的執行流，以上代碼編譯後將輸出 Cannot / 0 而不會輸出 done 。是否是很神奇？程序員

try catch 、 recover 、setjump 等機制會將程序當前狀態（主要是 cpu 的棧指針寄存器 sp 和程序計數器 pc ， Go 的 recover 是依賴 defer 來維護 sp 和 pc ）保存到一個與 throw、panic、longjump共享的內存裏。當有異常的時候，從該內存中提取以前保存的sp和pc寄存器值，直接將函數棧調回到sp指向的位置，並執行ip寄存器指向的下一條指令，將程序從異常狀態中恢復到正常狀態。編程

深刻 panic 和 recover

源碼

panic 和 recover 的源碼在 Go 源碼的 src/runtime/panic.go 裏，名爲 gopanic 和 gorecover 的函數。數組

// gopanic 的代碼，在 src/runtime/panic.go 第 454 行

// 預約義函數 panic 的實現
func gopanic(e interface{}) {
	gp := getg()
	if gp.m.curg != gp {
		print("panic: ")
		printany(e)
		print("\n")
		throw("panic on system stack")
	}

	if gp.m.mallocing != 0 {
		print("panic: ")
		printany(e)
		print("\n")
		throw("panic during malloc")
	}
	if gp.m.preemptoff != "" {
		print("panic: ")
		printany(e)
		print("\n")
		print("preempt off reason: ")
		print(gp.m.preemptoff)
		print("\n")
		throw("panic during preemptoff")
	}
	if gp.m.locks != 0 {
		print("panic: ")
		printany(e)
		print("\n")
		throw("panic holding locks")
	}

	var p _panic
	p.arg = e
	p.link = gp._panic
	gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

	atomic.Xadd(&runningPanicDefers, 1)

	for {
		d := gp._defer
		if d == nil {
			break
		}

        // 若是觸發 defer 的 panic 是在前一個 panic 或者 Goexit 的 defer 中觸發的，那麼將前一個 defer 從列表中去除。前一個 panic 或者 Goexit 將再也不繼續執行。
		if d.started {
			if d._panic != nil {
				d._panic.aborted = true
			}
			d._panic = nil
			d.fn = nil
			gp._defer = d.link
			freedefer(d)
			continue
		}

        // 將 defer 標記爲 started，可是保留在列表上，這樣，若是在 reflectcall 開始執行 d.fn 以前發生了堆棧增加或垃圾回收，則 traceback 能夠找到並更新 defer 的參數幀。
		d.started = true

        // 將正在執行 defer 的 panic 保存下來。若是在該 panic 的 defer 函數中觸發了新的 panic ，則新 panic 在列表中將會找到 d 並將 d._panic 標記爲 aborted 。
		d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

		p.argp = unsafe.Pointer(getargp(0))
		reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
		p.argp = nil

		// reflectcall 不會 panic，移除 d 。
		if gp._defer != d {
			throw("bad defer entry in panic")
		}
		d._panic = nil
		d.fn = nil
		gp._defer = d.link

		// 這裏用 GC() 來觸發堆棧收縮以測試堆棧拷貝。因爲是測試代碼，因此註釋掉了。參考 stack_test.go:TestStackPanic
		//GC()

		pc := d.pc
		sp := unsafe.Pointer(d.sp) // 必須是指針，以便在堆棧複製期間進行調整
        // defer 處理函數的內存是動態分配的，在執行完後須要釋放內存。因此，若是 defer 一直得不到執行（好比在死循環中一直建立 defer），將會致使內存泄露
		freedefer(d)
		if p.recovered {
			atomic.Xadd(&runningPanicDefers, -1)

			gp._panic = p.link
            // 已退出的 panic 已經被標記，但還遺留在 g.panic 列表裏，從列表裏移除他們。
			for gp._panic != nil && gp._panic.aborted {
				gp._panic = gp._panic.link
			}
			if gp._panic == nil { // must be done with signal
				gp.sig = 0
			}
			// 將正在恢復的棧幀傳給 recovery。
			gp.sigcode0 = uintptr(sp)
			gp.sigcode1 = pc
			mcall(recovery)
			throw("recovery failed") // mcall 不該該返回
		}
	}

	// 若是全部的 defer 都遍歷完畢，意味着沒有 recover（前面提到，mcall 執行 recovery 是不返回的），繼續執行 panic 後續流程，如：輸出調用棧信息和錯誤信息
	// 因爲在凍結世界以後調用任意用戶代碼是不安全的，所以咱們調用preprintpanics來調用全部必要的Error和String方法以在startpanic以前準備 panic 輸出的字符串。
	preprintpanics(gp._panic)

	fatalpanic(gp._panic) // 不該該返回
	*(*int)(nil) = 0      // 由於 fatalpanic 不該該返回，正常狀況下這裏不會執行。若是執行到了，這行代碼將觸發 panic
}
複製代碼

// gorecover 的代碼，在 src/runtime/panic.go 第 585 行

// 預約義函數 recover 的實現。
// 沒法拆分堆棧，由於它須要可靠地找到其調用方的堆棧段。
//
// TODO(rsc): Once we commit to CopyStackAlways,
// this doesn't need to be nosplit.
//go:nosplit
func gorecover(argp uintptr) interface{} {
	// 在處理 panic 的時候，recover 函數的調用必須放在 defer 的頂層處理函數中。
	// p.argp 是最頂層的延遲函數調用的參數指針，與調用方傳遞的argp進行比較，若是一致，則該調用方是能夠恢復的。
	gp := getg()
	p := gp._panic
	if p != nil && !p.recovered && argp == uintptr(p.argp) {
		p.recovered = true
		return p.arg
	}
	return nil
}
複製代碼

從函數代碼中咱們能夠看到 panic 內部主要流程是這樣：安全

獲取當前調用者所在的 g ，也就是 goroutine
遍歷並執行 g 中的 defer 函數
若是 defer 函數中有調用 recover ，並發現已經發生了 panic ，則將 panic 標記爲 recovered
在遍歷 defer 的過程當中，若是發現已經被標記爲 recovered ，則提取出該 defer 的 sp 與 pc，保存在 g 的兩個狀態碼字段中。

調用 runtime.mcall 切到 m->g0 並跳轉到 recovery 函數，將前面獲取的 g 做爲參數傳給 recovery 函數。 runtime.mcall 的代碼在 go 源碼的 src/runtime/asm_xxx.s 中，xxx 是平臺類型，如 amd64 。代碼以下：

// src/runtime/asm_amd64.s 第 274 行

// func mcall(fn func(*g))
// Switch to m->g0's stack, call fn(g). // Fn must never return. It should gogo(&g->sched) // to keep running g. TEXT runtime·mcall(SB), NOSPLIT, $0-8 MOVQ fn+0(FP), DI get_tls(CX) MOVQ g(CX), AX // save state in g->sched MOVQ 0(SP), BX // caller's PC
    MOVQ	BX, (g_sched+gobuf_pc)(AX)
    LEAQ	fn+0(FP), BX	// caller's SP MOVQ BX, (g_sched+gobuf_sp)(AX) MOVQ AX, (g_sched+gobuf_g)(AX) MOVQ BP, (g_sched+gobuf_bp)(AX) // switch to m->g0 & its stack, call fn MOVQ g(CX), BX MOVQ g_m(BX), BX MOVQ m_g0(BX), SI CMPQ SI, AX // if g == m->g0 call badmcall JNE 3(PC) MOVQ $runtime·badmcall(SB), AX JMP AX MOVQ SI, g(CX) // g = m->g0 MOVQ (g_sched+gobuf_sp)(SI), SP // sp = m->g0->sched.sp PUSHQ AX MOVQ DI, DX MOVQ 0(DI), DI CALL DI POPQ AX MOVQ $runtime·badmcall2(SB), AX JMP AX RET 複製代碼

這裏之因此要切到 m->g0 ，主要是由於 Go 的 runtime 環境是有本身的堆棧和 goroutine，而 recovery 是在 runtime 環境下執行的，因此要先調度到 m->g0 來執行 recovery 函數。bash

recovery 函數中，利用 g 中的兩個狀態碼回溯棧指針 sp 並恢復程序計數器 pc 到調度器中，並調用 gogo 從新調度 g ，將 g 恢復到調用 recover 函數的位置， goroutine 繼續執行。代碼以下：

// gorecover 的代碼，在 src/runtime/panic.go 第 637 行

// 在 panic 後，在延遲函數中調用 recover 的時候，將回溯堆棧，而且繼續執行，就像延遲函數的調用者正常返回同樣。
func recovery(gp *g) {
    // Info about defer passed in G struct.
    sp := gp.sigcode0
    pc := gp.sigcode1

    // 延遲函數的參數必須已經保存在堆棧中了（這裏經過判斷 sp 是否處於棧內存地址的範圍內來保障參數的正確處理）
    if sp != 0 && (sp < gp.stack.lo || gp.stack.hi < sp) {
        print("recover: ", hex(sp), " not in [", hex(gp.stack.lo), ", ", hex(gp.stack.hi), "]\n")
        throw("bad recovery")
    }

	// 讓延遲函數的 deferproc 再次返回，此次返回 1 。調用函數將跳轉到標準返回結尾。
    gp.sched.sp = sp
    gp.sched.pc = pc
    gp.sched.lr = 0
    gp.sched.ret = 1
    gogo(&gp.sched)
}
複製代碼

// src/runtime/asm_amd64.s 第 274 行

// func gogo(buf *gobuf)
// restore state from Gobuf; longjmp
TEXT runtime·gogo(SB), NOSPLIT, $16-8
    MOVQ	buf+0(FP), BX		// gobuf
    MOVQ	gobuf_g(BX), DX
    MOVQ	0(DX), CX		// make sure g != nil
    get_tls(CX)
    MOVQ	DX, g(CX)
    MOVQ	gobuf_sp(BX), SP	// 從 gobuf 中恢復 SP ，以便後面作跳轉
    MOVQ	gobuf_ret(BX), AX
    MOVQ	gobuf_ctxt(BX), DX
    MOVQ	gobuf_bp(BX), BP
    MOVQ	$0, gobuf_sp(BX)	// 這裏開始清理 gobuf ，以便垃圾回收。
    MOVQ	$0, gobuf_ret(BX)
    MOVQ	$0, gobuf_ctxt(BX)
    MOVQ	$0, gobuf_bp(BX)
    MOVQ	gobuf_pc(BX), BX    // 從 gobuf 中恢復 pc ，以便跳轉
    JMP	BX
複製代碼

以上即是 Go 底層處理異常的流程，精簡爲三步即是：併發

defer 函數中調用 recover
觸發 panic 並切到 runtime 環境獲取在 defer 中調用了 recover 的 g 的 sp 和 pc
恢復到 defer 中 recover 後面的處理邏輯

都有哪些坑

前面提到，panic 函數主要用於主動觸發異常。咱們在實現業務代碼的時候，在程序啓動階段，若是資源初始化出錯，能夠主動調用 panic 當即結束程序。對於新手來講，這沒什麼問題，很容易作到。ide

可是，現實每每是殘酷的—— Go 的 runtime 代碼中不少地方都調用了 panic 函數，對於不瞭解 Go 底層實現的新人來講，這無疑是挖了一堆深坑。若是不熟悉這些坑，是不可能寫出健壯的 Go 代碼。

接下來，筆者給你們細數下都有哪些坑。

數組( slice )下標越界

這個比較好理解，對於靜態類型語言，數組下標越界是致命錯誤。以下代碼能夠驗證：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var bar = []int{1}
    fmt.Println(bar[1])
}

func main(){
    foo()
    fmt.Println("exit")
}
複製代碼

輸出：

runtime error: index out of range
exit
複製代碼

由於代碼中用了 recover ，程序得以恢復，輸出 exit。

若是將 recover 那幾行註釋掉，將會輸出以下日誌：

panic: runtime error: index out of range

goroutine 1 [running]:
main.foo()
    /home/letian/work/go/src/test/test.go:14 +0x3e
main.main()
    /home/letian/work/go/src/test/test.go:18 +0x22
exit status 2
複製代碼

訪問未初始化的指針或 nil 指針

對於有 c/c++ 開發經驗的人來講，這個很好理解。但對於沒用過指針的新手來講，這是最多見的一類錯誤。以下代碼能夠驗證：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var bar *int
    fmt.Println(*bar)
}

func main(){
    foo()
    fmt.Println("exit")
}
複製代碼

輸出：

runtime error: invalid memory address or nil pointer dereference
exit
複製代碼

若是將 recover 那幾行代碼註釋掉，則會輸出：

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4869ff]

goroutine 1 [running]:
main.foo()
    /home/letian/work/go/src/test/test.go:14 +0x3f
main.main()
    /home/letian/work/go/src/test/test.go:18 +0x22
exit status 2
複製代碼

試圖往已經 close 的 `chan` 裏發送數據

這也是剛學用 chan 的新手容易犯的錯誤。以下代碼能夠驗證：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var bar = make(chan int, 1)
    close(bar)
    bar<-1
}

func main(){
    foo()
    fmt.Println("exit")
}
複製代碼

輸出：

send on closed channel
exit
複製代碼

若是註釋掉 recover ，將輸出：

panic: send on closed channel

goroutine 1 [running]:
main.foo()
    /home/letian/work/go/src/test/test.go:15 +0x83
main.main()
    /home/letian/work/go/src/test/test.go:19 +0x22
exit status 2
複製代碼

源碼處理邏輯在 src/runtime/chan.go 的 chansend 函數中，以下圖：

// src/runtime/chan.go 第 269 行

// 若是 block 不爲 nil ，則協議將不會休眠，但若是沒法完成則返回。
// 當關閉休眠中的通道時，可使用 g.param == nil 喚醒睡眠。
// 咱們能夠很是容易循環並從新運行該操做，而且將會看到它處於已關閉狀態。
func chansend(c *hchan, ep unsafe.Pointer, block bool, callerpc uintptr) bool {
    if c == nil {
        if !block {
            return false
        }
        gopark(nil, nil, waitReasonChanSendNilChan, traceEvGoStop, 2)
        throw("unreachable")
    }

    if debugChan {
        print("chansend: chan=", c, "\n")
    }

    if raceenabled {
        racereadpc(c.raceaddr(), callerpc, funcPC(chansend))
    }

    // Fast path: check for failed non-blocking operation without acquiring the lock.
    //
    // After observing that the channel is not closed, we observe that the channel is
    // not ready for sending. Each of these observations is a single word-sized read
    // (first c.closed and second c.recvq.first or c.qcount depending on kind of channel).
    // Because a closed channel cannot transition from 'ready for sending' to
    // 'not ready for sending', even if the channel is closed between the two observations,
    // they imply a moment between the two when the channel was both not yet closed
    // and not ready for sending. We behave as if we observed the channel at that moment,
    // and report that the send cannot proceed.
    //
    // It is okay if the reads are reordered here: if we observe that the channel is not
    // ready for sending and then observe that it is not closed, that implies that the
    // channel wasn't closed during the first observation.
    if !block && c.closed == 0 && ((c.dataqsiz == 0 && c.recvq.first == nil) ||
        (c.dataqsiz > 0 && c.qcount == c.dataqsiz)) {
        return false
    }

    var t0 int64
    if blockprofilerate > 0 {
        t0 = cputicks()
    }

    lock(&c.lock)

    if c.closed != 0 {
        unlock(&c.lock)
        panic(plainError("send on closed channel"))
    }

    if sg := c.recvq.dequeue(); sg != nil {
        // Found a waiting receiver. We pass the value we want to send
        // directly to the receiver, bypassing the channel buffer (if any).
        send(c, sg, ep, func() { unlock(&c.lock) }, 3)
        return true
    }

    if c.qcount < c.dataqsiz {
        // Space is available in the channel buffer. Enqueue the element to send.
        qp := chanbuf(c, c.sendx)
        if raceenabled {
            raceacquire(qp)
            racerelease(qp)
        }
        typedmemmove(c.elemtype, qp, ep)
        c.sendx++
        if c.sendx == c.dataqsiz {
            c.sendx = 0
        }
        c.qcount++
        unlock(&c.lock)
        return true
    }

    if !block {
        unlock(&c.lock)
        return false
    }

    // Block on the channel. Some receiver will complete our operation for us.
    gp := getg()
    mysg := acquireSudog()
    mysg.releasetime = 0
    if t0 != 0 {
        mysg.releasetime = -1
    }
    // No stack splits between assigning elem and enqueuing mysg
    // on gp.waiting where copystack can find it.
    mysg.elem = ep
    mysg.waitlink = nil
    mysg.g = gp
    mysg.isSelect = false
    mysg.c = c
    gp.waiting = mysg
    gp.param = nil
    c.sendq.enqueue(mysg)
    goparkunlock(&c.lock, waitReasonChanSend, traceEvGoBlockSend, 3)
    // Ensure the value being sent is kept alive until the
    // receiver copies it out. The sudog has a pointer to the
    // stack object, but sudogs aren't considered as roots of the
    // stack tracer.
    KeepAlive(ep)

    // someone woke us up.
    if mysg != gp.waiting {
        throw("G waiting list is corrupted")
    }
    gp.waiting = nil
    if gp.param == nil {
        if c.closed == 0 {
            throw("chansend: spurious wakeup")
        }
        panic(plainError("send on closed channel"))
    }
    gp.param = nil
    if mysg.releasetime > 0 {
        blockevent(mysg.releasetime-t0, 2)
    }
    mysg.c = nil
    releaseSudog(mysg)
    return true
}
複製代碼

併發讀寫相同 map

對於剛學併發編程的同窗來講，併發讀寫 map 也是很容易遇到的問題。以下代碼能夠驗證：

package main

  import (
      "fmt"
  )

  func foo(){
      defer func(){
          if err := recover(); err != nil {
              fmt.Println(err)
          }
      }()
      var bar = make(map[int]int)
      go func(){
          defer func(){
              if err := recover(); err != nil {
                  fmt.Println(err)
              }
          }()
          for{
              _ = bar[1]
          }
      }()
      for{
          bar[1]=1
      }
  }

  func main(){
      foo()
      fmt.Println("exit")
  }
複製代碼

輸出：

fatal error: concurrent map read and map write

  goroutine 5 [running]:
  runtime.throw(0x4bd8b0, 0x21)
      /home/letian/.gvm/gos/go1.12/src/runtime/panic.go:617 +0x72 fp=0xc00004c780 sp=0xc00004c750 pc=0x427f22
  runtime.mapaccess1_fast64(0x49eaa0, 0xc000088180, 0x1, 0xc0000260d8)
      /home/letian/.gvm/gos/go1.12/src/runtime/map_fast64.go:21 +0x1a8 fp=0xc00004c7a8 sp=0xc00004c780 pc=0x40eb58
  main.foo.func2(0xc000088180)
      /home/letian/work/go/src/test/test.go:21 +0x5c fp=0xc00004c7d8 sp=0xc00004c7a8 pc=0x48708c
  runtime.goexit()
      /home/letian/.gvm/gos/go1.12/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00004c7e0 sp=0xc00004c7d8 pc=0x450e51
  created by main.foo
      /home/letian/work/go/src/test/test.go:14 +0x68

  goroutine 1 [runnable]:
  main.foo()
      /home/letian/work/go/src/test/test.go:25 +0x8b
  main.main()
      /home/letian/work/go/src/test/test.go:30 +0x22
  exit status 2
複製代碼

細心的朋友不難發現，輸出日誌裏沒有出現咱們在程序末尾打印的 exit，而是直接將調用棧打印出來了。查看 src/runtime/map.go 中的代碼不難發現這幾行：

if h.flags&hashWriting != 0 {
      throw("concurrent map read and map write")
  }
複製代碼

與前面提到的幾種狀況不一樣，runtime 中調用 throw 函數拋出的異常是沒法在業務代碼中經過 recover 捕獲的，這點最爲致命。因此，對於併發讀寫 map 的地方，應該對 map 加鎖。

類型斷言

在使用類型斷言對 interface 進行類型轉換的時候也容易一不當心踩坑，並且這個坑是即便用 interface 有一段時間的人也容易忽略的問題。以下代碼能夠驗證：

package main

import (
    "fmt"
)

func foo(){
    defer func(){
        if err := recover(); err != nil {
            fmt.Println(err)
        }
    }()
    var i interface{} = "abc"
    _ = i.([]string)
}

func main(){
    foo()
    fmt.Println("exit")
}
複製代碼

輸出：

interface conversion: interface {} is string, not []string
exit
複製代碼

源碼在 src/runtime/iface.go 中，以下兩個函數：

// panicdottypeE is called when doing an e.(T) conversion and the conversion fails.
// have = the dynamic type we have.
// want = the static type we're trying to convert to.
// iface = the static type we're converting from.
func panicdottypeE(have, want, iface *_type) {
    panic(&TypeAssertionError{iface, have, want, ""})
}

// panicdottypeI is called when doing an i.(T) conversion and the conversion fails.
// Same args as panicdottypeE, but "have" is the dynamic itab we have.
func panicdottypeI(have *itab, want, iface *_type) {
    var t *_type
    if have != nil {
        t = have._type
    }
    panicdottypeE(t, want, iface)
}
複製代碼