實戰分析一個運行起來會卡死的Go程序

序言linux

最近一位很是熱心的網友建議結合demo來分析一下goroutine的調度器,並且還提供了一個demo代碼,因而便有了本文,在此對這位網友表示衷心的感謝!ubuntu

這位網友提供的demo程序可能有的gopher之前見過,已經知道了具體緣由,但本文假定咱們是第一次遇到這種問題,而後從零開始,經過一步一步的分析和定位,最終找到問題的根源及解決方案。sass

雖然本文不須要太多的背景知識,但最好使用過gdb或delve調試工具,瞭解彙編語言及函數調用棧固然就更好了。app

本文咱們須要重點了解下面這3個內容。函數

  1. 調試工具沒法準確顯示函數調用棧時如何找到函數調用鏈;工具

  2. 發生GC時,如何STOP THE WORLD;oop

  3. 何時搶佔調度不會起做用以及如何規避。ui

本文的實驗環境爲AMD64 Linux + go1.12atom

Demo程序及運行現象spa

package main

import(
    "fmt"
    "runtime"
    "time"
)

func deadloop() {
    for {
    }
}

func worker() {
    for {
        fmt.Println("worker is running")
        time.Sleep(time.Second * 1)
    }
}

func main() {
    fmt.Printf("There are %d cores.\n", runtime.NumCPU())

    goworker()

    godeadloop()

    i := 3
    for {
        fmt.Printf("main is running, i=%d\n", i)
        i--
        if i == 0 {
            runtime.GC()
        }
   
        time.Sleep(time.Second * 1)
    }
}

 

編譯並運行,結果:

bobo@ubuntu:~/study/go$ ./deadlock
There are 4cores.
main is running, i=3
worker is running
main is running, i=2
worker is running
worker is running
main is running, i=1
worker is running

程序運行起來打印了這幾條信息以後就再也沒有輸出任何信息,看起來程序好像卡死了!

咱們第一次遇到這種問題,該如何着手開始分析呢?

分析代碼

首先來分析一下代碼,這個程序啓動以後將會在main函數中建立一個worker goroutine和一個deadloop goroutine,加上main goroutine,一共應該有3個用戶goroutine,其中

  1. dealloop goroutine一直在執行一個死循環,並未作任何實際的工做;

  2. worker goroutine每隔一秒循環打印worker is running;

  3. main goroutine也一直在執行着一個循環,每隔一秒打印一下main is running,同時輸出變量i的值並對i執行減減操做,當i等於0的時候會去調用runtime.GC函數觸發垃圾回收。

由於咱們目前掌握的知識有限,因此暫時看不出有啥問題,看起來一切都應該很正常纔對,爲何會卡死呢?

分析日誌

看不出程序有什麼問題,咱們就只能再來仔細看一下輸出的日誌信息。從日誌信息能夠看出,一開始main goroutine和worker 還很正常,但當打印了i = 1以後,main goroutine就再也沒有輸出信息了,而這以後worker也只打印了一次就沒有再打印信息了。

從代碼能夠知道,打印了i = 1以後i就自減了1變成了0,i等於0以後就會去執行runtime.GC(),因此咱們有理由懷疑卡死跟GC垃圾回收有關,懷疑歸懷疑,咱們須要拿出證據來證實它們確實有關才行。怎麼找證據呢?

跟蹤函數調用鏈

由於程序並無退出,而是卡起了,咱們會很天然的想到經過調試工具來看一下到底發生了什麼事情。這裏咱們使用delve這個專門爲Go程序定製的調試器。

使用pidof命令找到deadlock的進程ID,而後使用dlv attach上去,並經過goroutines命令查看程序中的goroutine

bobo@ubuntu:~/study/go$ pidofdeadlock
2369
bobo@ubuntu:~/study/go$ sudodlv attach 2369
Type 'help'forlist of commands.
(dlv) goroutines
Goroutine 1-User: /usr/local/go/src/runtime/mgc.go:1055 runtime.GC (0x416ab8)
Goroutine 2-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 3-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 4-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 5-User: /usr/local/go/src/runtime/proc.go:307 time.Sleep (0x442a09)
Goroutine 6-User: ./deadlock.go:10 main.deadloop (0x488f90) (thread 2372)
Goroutine 7-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 17-User: /usr/local/go/src/runtime/proc.go:3005 runtime.exitsyscall (0x4307e6)
Goroutine 33-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 34-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 35-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 36-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 37-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
Goroutine 49-User: /usr/local/go/src/runtime/proc.go:302 runtime.gopark (0x429b2f)
[14 goroutines]
(dlv) 

從輸出信息能夠看到程序中一共有14個goroutine,其它的goroutine不用管,咱們只關心那3個用戶goroutine,容易看出它們分別是

Goroutine 1-User: /usr/local/go/src/runtime/mgc.go:1055 runtime.GC (0x416ab8)  #main goroutine
Goroutine 5-User: /usr/local/go/src/runtime/proc.go:307 time.Sleep (0x442a09)     #worker goroutine
Goroutine 6-User: ./deadlock.go:10 main.deadloop (0x488f90) (thread 2372)         #deadloop goroutine

由於咱們懷疑卡死跟runtime.GC()函數調用有關,因此咱們切換到Goroutine 1並使用backtrace命令(簡寫bt)查看一下main goroutine的函數調用棧:

(dlv) goroutine 1
Switched from 0to 1(thread 2371)
(dlv) bt
0 0x0000000000453383 inruntime.futex at /usr/local/go/src/runtime/sys_linux_amd64.s:536
1 0x000000000044f5d0 inruntime.systemstack_switch at /usr/local/go/src/runtime/asm_amd64.s:311
2 0x0000000000416eb9 inruntime.gcStart at /usr/local/go/src/runtime/mgc.go:1284
3 0x0000000000416ab8 inruntime.GC at /usr/local/go/src/runtime/mgc.go:1055
4 0x00000000004891a6 inmain.main at ./deadlock.go:39
5 0x000000000042974c inruntime.main at /usr/local/go/src/runtime/proc.go:200
6 0x0000000000451521 inruntime.goexit at /usr/local/go/src/runtime/asm_amd64.s:1337
(dlv) 

從輸出能夠看到main goroutine的函數調用鏈爲:

main()->runtime.GC()->runtime.gcStart()->runtime.systemstack_switch()->runtime.futex

咱們從main函數開始順着這個鏈去看一下源代碼,會發現mgc.go的1284行代碼並不是systemstack_switch函數,而是systemstack(stopTheWorldWithSema)這一句代碼,在這裏,這句代碼的意思是從main goroutine的棧切換到g0棧並執行stopTheWorldWithSema函數,但從上面的函數調用棧並未看到stopTheWorldWithSema函數的身影,這多是由於從main goroutine的棧切換到了g0棧致使調試工具沒有處理好?無論怎麼樣,咱們須要找到從stopTheWorldWithSema函數到runtime.futex函數的調用路徑才能搞清楚到底發生了什麼事情。

手動追蹤函數調用鏈

既然調試工具顯示的函數調用路徑有問題,咱們就須要手動來找到它,首先反彙編看一下當前正要運行的指令:

(dlv) disass
TEXT runtime.futex(SB) /usr/local/go/src/runtime/sys_linux_amd64.s
        mov    rdi, qword ptr [rsp+0x8]
        mov    esi, dword ptr [rsp+0x10]
        mov    edx, dword ptr [rsp+0x14]
        mov    r10, qword ptr [rsp+0x18]
        mov    r8, qword ptr [rsp+0x20]
        mov    r9d, dword ptr [rsp+0x28]
        mov    eax, 0xca
        syscall
=>      mov    dword ptr [rsp+0x30], eax
        ret

 

反彙編結果告訴咱們,下一條即將執行的指令是sys_linux_amd64.s文件中的futex函數的倒數第二條指令:

==> mov    dword ptr [rsp+0x30], eax

爲了搞清楚誰調用了futex函數,咱們須要讓futex執行完並返回到調用它的函數中去,屢次使用si單步執行命令,程序返回到了runtime.futexsleep函數,以下:

(dlv) si
> runtime.futex() /usr/local/go/src/runtime/sys_linux_amd64.s:536 
      MOVQ    ts+16(FP), R10
      MOVQ    addr2+24(FP), R8
      MOVL    val3+32(FP), R9
      MOVL    $SYS_futex, AX
      SYSCALL
=>    MOVL     AX, ret+40(FP)
      RET
(dlv) si
> runtime.futex() /usr/local/go/src/runtime/sys_linux_amd64.s:537 
      MOVQ    addr2+24(FP), R8
      MOVL     val3+32(FP), R9
      MOVL     $SYS_futex, AX
      SYSCALL
      MOVL     AX, ret+40(FP)
=>    RET
(dlv) si
> runtime.futexsleep() /usr/local/go/src/runtime/os_linux.go:64 
          }else {
              ts.tv_nsec =0
              ts.set_sec(int64(timediv(ns, 1000000000, (*int32)(unsafe.Pointer(&ts.tv_nsec)))))
          }
          futex(unsafe.Pointer(addr), _FUTEX_WAIT_PRIVATE, val, unsafe.Pointer(&ts), nil, 0)
=>  }
  
      // If any procs are sleeping on addr, wake up at most cnt.
      //go:nosplit
      funcfutexwakeup(addr *uint32, cnt uint32) {
           ret:=futex(unsafe.Pointer(addr), _FUTEX_WAKE_PRIVATE, cnt, nil, nil, 0)
(dlv) 

如今程序停在了os_linux.go的64行(=>這個符號表示程序當前停在這裏),這是futexsleep函數的最後一行,使用n命令單步執行一行go代碼,從runteme.futexsleep函數返回到了runtime.notetsleep_internal函數:

(dlv) n
>runtime.notetsleep_internal() /usr/local/go/src/runtime/lock_futex.go:194
              if *cgo_yield != nil && ns > 10e6 {
                  ns = 10e6
              }
              gp.m.blocked = true
              futexsleep(key32(&n.key), 0, ns)
=>            if *cgo_yield != nil {
                  asmcgocall(*cgo_yield, nil)
              }
              gp.m.blocked = false
              if atomic.Load(key32(&n.key)) != 0 {
                  break

在runtime.notetsleep_internal函數中再連續使用幾回n命令,函數從runtime.notetsleep_internal返回到了runtime.notetsleep函數:

(dlv) n
>runtime.notetsleep() /usr/local/go/src/runtime/lock_futex.go:210
=>func notetsleep(n *note, ns int64) bool{
          gp := getg()
          if gp != gp.m.g0&&gp.m.preemptoff != "" {
               throw("notetsleep not on g0")
          }
    
          return notetsleep_internal(n, ns)
      }

爲了搞清楚誰調用了notetsleep函數,繼續執行幾回n,奇怪的事情發生了,竟然沒法從notetsleep函數返回到調用它的函數中去,一直在notetsleep這個函數打轉,好像發生了遞歸調用同樣,見下:

(dlv) n
>runtime.notetsleep() /usr/local/go/src/runtime/lock_futex.go:211
         func notetsleep(n *note, ns int64) bool {
=>          gp := getg()
            if gp!= gp.m.g0 && gp.m.preemptoff != "" {
                throw("notetsleep not on g0")
            }
  
            return notetsleep_internal(n, ns)
        }
(dlv) n
>runtime.notetsleep() /usr/local/go/src/runtime/lock_futex.go:216
        func notetsleep(n *note, ns int64) bool {
            gp := getg()
            if gp != gp.m.g0 && gp.m.preemptoff != "" {
                throw("notetsleep not on g0")
            }
  
=>          return notetsleep_internal(n, ns)
        }
(dlv) n
>runtime.notetsleep() /usr/local/go/src/runtime/lock_futex.go:210
=>    func notetsleep(n *note, ns int64) bool {
             gp := getg()
             if gp != gp.m.g0 && gp.m.preemptoff != "" {
                 throw("notetsleep not on g0")
             }
  
             return notetsleep_internal(n, ns)
         }

notetsleep函數只有簡單的幾行代碼,並無遞歸調用,這真有點詭異,看來這個調試器還真有點問題。咱們反彙編來看一下:

(dlv) disass
TEXT runtime.notetsleep(SB) /usr/local/go/src/runtime/lock_futex.go
=>      mov  rcx, qword ptr fs:[0xfffffff8]
        cmp  rsp, qword ptr [rcx+0x10]
        jbe    0x4095df
        sub   rsp, 0x20
        mov  qwordptr[rsp+0x18], rbp
        lea    rbp, ptr [rsp+0x18]
        mov  rax, qword ptr fs:[0xfffffff8]
        ......

如今程序停在notetsleep函數的第一條指令。咱們知道,只要發生了函數調用,這個時候CPU的rsp寄存器必定指向這個函數執行完成以後的返回地址,因此咱們看一下rsp寄存器的值

(dlv) regs
    Rip=0x0000000000409560
    Rsp=0x000000c000045f60
    ......

獲得rsp寄存器的值以後咱們來看一下它所指的內存單元中存放的是什麼:

(dlv) p *(*int)(0x000000c000045f60)
4374697

若是這個4374697是返回地址,那必定能夠在這個地方下一個執行斷點,試一試看:

(dlv) b *4374697
Breakpoint 1 set at 0x42c0a9 for runtime.stopTheWorldWithSema() /usr/local/go/src/runtime/proc.go:1050

真是蒼天不負有心人,終於找到了stopTheWorldWithSema()函數,斷點告訴咱們runtime/proc.go文件的1050行調用了notetsleep函數,咱們打開源代碼能夠看到這個地方確實是在一個循環中調用notetsleep函數。

到此,咱們獲得了main goroutine完整的函數調用路徑:

main()->runtime.GC()->runtime.gcStart()->runtime.stopTheWorldWithSema()->runtime.notetsleep_internal()->runtime.futexsleep()->runtime.futex()

分析stopTheWorldWithSema函數

接着,咱們來仔細的看一下stopTheWorldWithSema函數爲何會調用notetsleep函數進入睡眠:

// stopTheWorldWithSema is the core implementation of stopTheWorld.
// The caller is responsible for acquiring worldsema and disabling
// preemption first and then should stopTheWorldWithSema on the system
// stack:
//
//semacquire(&worldsema, 0)
//m.preemptoff = "reason"
//systemstack(stopTheWorldWithSema)
//
// When finished, the caller must either call startTheWorld or undo
// these three operations separately:
//
//m.preemptoff = ""
//systemstack(startTheWorldWithSema)
//semrelease(&worldsema)
//
// It is allowed to acquire worldsema once and then execute multiple
// startTheWorldWithSema/stopTheWorldWithSema pairs.
// Other P's are able to execute between successive calls to
// startTheWorldWithSema and stopTheWorldWithSema.
// Holding worldsema causes any other goroutines invoking
// stopTheWorld to block.
func stopTheWorldWithSema() {
    _g_ := getg()  //由於在g0棧運行,因此_g_ = g0

    ......

    lock(&sched.lock)
    sched.stopwait = gomaxprocs  // gomaxprocs即p的數量,須要等待全部的p停下來
    atomic.Store(&sched.gcwaiting, 1) //設置gcwaiting標誌,表示咱們正在等待着垃圾回收
    preemptall()  //設置搶佔標記,但願處於運行之中的goroutine停下來
    // stop current P,暫停當前P
    _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
    sched.stopwait--
    // try to retake all P's in Psyscall status
    for _, p := range allp {
        s := p.status
        //經過修改p的狀態爲_Pgcstop搶佔那些處於系統調用之中的goroutine
        if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
            if trace.enabled {
                traceGoSysBlock(p)
                traceProcStop(p)
            }
            p.syscalltick++  
            sched.stopwait--
        }
    }
    // stop idle P's
    for { //修改idle隊列中p的狀態爲_Pgcstop,這樣就不會被工做線程拿去使用了
        p := pidleget()
        if p == nil {
            break
        }
        p.status = _Pgcstop
        sched.stopwait--
    }
    wait := sched.stopwait > 0
    unlock(&sched.lock)

    // wait for remaining P's to stop voluntarily
    if wait {
        for {
            // wait for 100us, then try to re-preempt in case of any races
            if notetsleep(&sched.stopnote, 100*1000) {  //咱們這個場景程序卡在了這裏
                noteclear(&sched.stopnote)
                break
            }
            preemptall() //循環中反覆設置搶佔標記
        }
    }

    ......
}

stopTheWorldWithSema函數流程比較清晰:

  1. 經過preemptall() 函數對那些正在運行go代碼的goroutine設置搶佔標記;

  2. 停掉當前工做線程所綁定的p;

  3. 經過cas操做修改那些處於系統調用之中的p的狀態爲_Pgcstop從而停掉對應的p;

  4. 修改idle隊列中p的狀態爲_Pgcstop;

  5. 等待處於運行之中的p停下來。

從這個流程能夠看出,stopTheWorldWithSema函數主要經過兩種方式來Stop The World:

  1. 對於那些此時此刻並未運行go代碼的p,包括位於空閒隊列之中的p以及處於系統調用之中的p,經過直接設置其狀態爲_Pgcstop來阻止工做線程綁定它們,從而保持內存引用的一致性。由於工做線程要執行go代碼就必需要綁定p,沒有p工做線程就沒法運行go代碼,不運行go代碼也就沒法修改內存之間的引用關係;

  2. 對於那些此時此刻綁定到某個工做線程正在運行go代碼的p,不能簡單的修改其狀態,只能經過設置搶佔標記來請求它們停下來;

從前面的分析咱們已經知道,deadlock程序卡在了下面這個for循環之中:

for {
    // wait for 100us, then try to re-preempt in case of any races
    if notetsleep(&sched.stopnote, 100 * 1000) {  //咱們這個場景程序卡在了這裏
        noteclear(&sched.stopnote)
        break
    }
    preemptall() //循環中反覆設置搶佔標記
}

程序一直在執行上面這個for循環,在這個循環之中,代碼經過反覆調用preemptall()來對那些正在運行的goroutine設置搶佔標記而後經過notetsleep函數來等待這些goroutine的暫停。從程序的運行現象及咱們的分析來看,應該是有goroutine沒有暫停下來致使了這裏的for循環沒法break出去。

尋找沒有暫停下來的goroutine

再次看一下咱們的3個用戶goroutine:

Goroutine 1-User: /usr/local/go/src/runtime/mgc.go:1055 runtime.GC (0x416ab8)  #main goroutine
Goroutine 5-User: /usr/local/go/src/runtime/proc.go:307 time.Sleep (0x442a09)     #worker goroutine
Goroutine 6-User: ./deadlock.go:10 main.deadloop (0x488f90) (thread 2372)         #deadloop goroutine

Goroutine 1所在的工做線程正在執行上面的for循環,因此不多是它沒有停下來,再來看Goroutine 5:

(dlv) goroutine 5
Switched from 0to 5(thread 2765)
(dlv) bt
0 0x0000000000429b2f inruntime.gopark at /usr/local/go/src/runtime/proc.go:302
1 0x0000000000442a09 inruntime.goparkunlock at /usr/local/go/src/runtime/proc.go:307
2 0x0000000000442a09 intime.Sleep at /usr/local/go/src/runtime/time.go:105
3 0x0000000000489023 inmain.worker at ./deadlock.go:19
4 0x0000000000451521 inruntime.goexit at /usr/local/go/src/runtime/asm_amd64.s:1337

從函數調用棧能夠看出來goroutine 5已經停在了gopark處,因此應該是goroutine 6沒有停下來,咱們切換到goroutine 6看看它的函數調用棧以及正在執行的指令:

(dlv) goroutine6
Switchedfrom5to6(thread2768)
(dlv) bt
0 0x0000000000488f90inmain.deadloop at./deadlock.go:10
1 0x0000000000451521inruntime.goexit at/usr/local/go/src/runtime/asm_amd64.s:1337
(dlv) disass
TEXT main.deadloop(SB) /home/bobo/study/go/deadlock.go
=>deadlock.go:10 0x488f90ebfe jmp $main.deadloop
(dlv) 

能夠看出來goroutine一直在這裏執行jmp指令跳轉到本身所在的位置。爲了搞清楚它爲何停不下來,咱們須要看一下preemptall() 函數究竟是怎麼請求goroutine暫停的。

// Tell all goroutines that they have been preempted and they should stop.
// This function is purely best-effort. It can fail to inform a goroutine if a
// processor just started running it.
// No locks need to be held.
// Returns true if preemption request was issued to at least one goroutine.
func preemptall() bool {
    res := false
    for _, _p_ := rangeallp { //遍歷全部的p
        if _p_.status != _Prunning { 
            continue
        }
   
        //只請求處於運行狀態的goroutine暫停
        if preemptone(_p_) {
            res = true
        }
    }
    return res
}

繼續看preemptone函數:

// Tell the goroutine running on processor P to stop.
// This function is purely best-effort. It can incorrectly fail to inform the
// goroutine. It can send inform the wrong goroutine. Even if it informs the
// correct goroutine, that goroutine might ignore the request if it is
// simultaneously executing newstack.
// No lock needs to be held.
// Returns true if preemption request was issued.
// The actual preemption will happen at some point in the future
// and will be indicated by the gp->status no longer being
// Grunning
func preemptone(_p_ *p) bool{
    mp := _p_.m.ptr()
    if mp==nil || mp == getg().m {
        return false
    }
    gp := mp.curg //經過p找到正在執行的goroutine
    if gp == nil || gp == mp.g0 {
        return false
    }

    gp.preempt = true //設置搶佔調度標記

    // Every call in a go routine checks for stack overflow by
    // comparing the current stack pointer to gp->stackguard0.
    // Setting gp->stackguard0 to StackPreempt folds
    // preemption into the normal stack overflow check.
    gp.stackguard0 = stackPreempt  //設置擴棧標記,這裏用來觸發被請求goroutine執行擴棧函數
    return true
}

從preemptone函數能夠看出,所謂的搶佔僅僅是給正在運行的goroutine設置一個標誌而已,並無使用什麼有效的手段強制其停下來,因此被請求的goroutine應該須要去檢查preempt和stackguard0這兩個標記。但從上面deallock函數的彙編代碼看起來它並無去檢查這兩個標記,它只有一條跳轉到自身執行死循環的指令,因此它應該是沒法處理暫停請求的,也就無法停下來,於是這才致使了上面那個等待它停下來的for循環一直沒法退出,最終致使整個程序像是卡死了同樣的現象。

到此,咱們已通過找到程序假死的表面緣由是,由於執行deadlock函數的goroutine沒有暫停致使垃圾回收沒法進行,從而致使其它已經暫停了的goroutine沒法恢復運行。但爲何其它goroutine能夠暫停下來呢,惟獨這個goroutine不行,咱們須要繼續分析。

探索真相

從上面的分析咱們得知,preemptone函數經過設置

gp.preempt = true
gp.stackguard0 = stackPreempt //stackPreempt = 0xfffffffffffffade

來請求正在運行的goroutine暫停。爲了找到哪裏的代碼會去檢查這些標誌,咱們使用文本搜索工具在源代碼中查找「preempt」、「stackPreempt」以及「stackguard0」這3個字符串,能夠找處處理搶佔請求的函數爲newstack(),在該函數中若是發現本身被搶佔,則會暫停當前goroutine的執行。而後再查找哪些函數會調用newstack函數,順藤摸瓜即可以找到相關的函數調用鏈爲

morestack_noctxt()->morestack()->newstack()

從源代碼中morestack函數的註釋能夠知道,該函數會被編譯器插入到函數的序言(prologue)尾聲(epilogue)之中

// Called during function prolog when more stack is needed.
//
// The traceback routines see morestack on a g0 as being
// the top of a stack (for example, morestack calling newstack
// calling the scheduler calling newm calling gc), so we must
// record an argument size. For that purpose, it has no arguments.
TEXT runtime·morestack(SB),NOSPLIT,$0-0

爲了驗證這個註釋,咱們反彙編一下main函數看看:

TEXT main.main(SB) /home/bobo/study/go/deadlock.go
   0x0000000000489030<+0>:     mov   %fs:0xfffffffffffffff8,%rcx
   0x0000000000489039<+9>:     cmp   0x10(%rcx),%rsp
   0x000000000048903d<+13>:    jbe   0x4891b0 <main.main+384>
   0x0000000000489043<+19>:    sub   $0x80,%rsp
   0x000000000048904a<+26>:    mov   %rbp,0x78(%rsp)
   0x000000000048904f<+31>:    lea   0x78(%rsp),%rbp
   ......
   0x00000000004891a1<+369>:   callq 0x416a60 <runtime.GC>
   0x00000000004891a6<+374>:   mov   0x50(%rsp),%rax
   0x00000000004891ab<+379>:   jmpq   0x489108 <main.main+216>
   0x00000000004891b0<+384>:   callq 0x44f730 <runtime.morestack_noctxt>
   0x00000000004891b5<+389>:   jmpq   0x489030 <main.main>

 

在main函數的尾部咱們看到了對runtime.morestack_noctxt函數的調用,往前咱們能夠看到,對runtime.morestack_noctxt的調用是經過main函數的第三條jbe指令跳轉過來的。

0x000000000048903d<+13>:    jbe   0x4891b0 <main.main+384>
......
0x00000000004891b0<+384>:   callq 0x44f730 <runtime.morestack_noctxt>

jbe是條件跳轉指令,它依靠上一條指令的執行結果來判斷是否須要跳轉。這裏的上一條指令是main函數的第二條指令,爲了看清楚這裏到底在幹什麼,咱們把main函數的前三條指令都列出來:

0x0000000000489030<+0>:    mov   %fs:0xfffffffffffffff8,%rcx  #main函數第一條指令
0x0000000000489039<+9>:    cmp   0x10(%rcx),%rsp        #main函數第二條指令
0x000000000048903d<+13>:   jbe   0x4891b0 <main.main+384>  #main函數第三條指令

在我寫的Go語言調度器源代碼情景分析系列文章中曾經介紹過,go語言使用fs寄存器實現系統線程的本地存儲(TLS),main函數的第一條指令就是從TLS中讀取當前正在運行的g的指針並放入rcx寄存器,第二條指令的源操做數是間接尋址,從內存中讀取相對於g偏移16這個地址中的內容到rsp寄存器,咱們來看看g偏移16的地址是放的什麼東西,首先再來回顧一下g結構體的定義:

type g struct {
    stack         stack  
    stackguard0   uintptr
    stackguard1   uintptr
    ......
}

type stack struct {
    lo  uintptr     //8 bytes
    hi  uintptr     //8 bytes
}

能夠看到結構體g的第一個成員stack佔16個字節(lo和hi各佔8字節),因此g結構體變量的起始位置加偏移16就應該對應到stackguard0字段。所以main函數的第二條指令至關於在比較棧頂寄存器rsp的值是否比stackguard0的值小,若是rsp的值更小,說明當前g的棧要用完了,有溢出風險,須要調用morestack_noctxt函數來擴棧,從前面的分析咱們知道,preemptone函數在設置搶佔標誌時把須要被搶佔的goroutine的stackguard0成員設置成了stackPreempt,而stackPreempt是一個很大的整數0xfffffffffffffade,對於goroutine來講其rsp棧頂不可能這麼大。所以任何一個goroutine對應的g結構體對象的stackguard0成員一旦被設置爲搶佔標記,在進行函數調用時就會經過由編譯器插入的指令去調用morestack_noctxt函數。

對於咱們這個場景中的deadlock函數,它一直在執行jmp指令,並無調用其它函數,因此它沒有機會去檢查g結構體對象的stackguard0成員,也就不會經過調用morestack_noctxt函數去執行處理搶佔請求的newstack()函數(在該函數中若是發現本身被搶佔,則會暫停當前goroutine的執行),固然也就停不下來了。

知道了問題的根源,要解決它就比較簡單了,只須要在deadlock函數的for循環中調用一下其它函數應該就好了,讀者能夠本身去驗證一下。不過須要提示一點的是,編譯器並不會爲每一個函數都插入檢查是否須要擴棧的代碼,只有編譯器以爲某個函數有棧溢出風險纔會在函數開始和結尾處插入剛剛咱們分析過的prologue和epilogue代碼。

結論

從本文的分析咱們能夠看到,Go語言中的搶佔調度實際上是一種協做式搶佔調度,它須要被搶佔goroutine的配合才能順利完成,而這種配合是經過編譯器在函數的序言和尾聲中插入的檢測代碼而實現的。這也提示咱們,在編寫go代碼時須要避免純計算式的長時間循環,這可能致使程序假死或STW時間過長。

相關文章
相關標籤/搜索