golang----GC的實現原理

時間 2019-11-16

標籤 golang 實現原理欄目 Go 简体版

原文原文鏈接

Golang從1.5開始引入了三色GC, 通過屢次改進, 當前的1.9版本的GC停頓時間已經能夠作到極短.
停頓時間的減小意味着"最大響應時間"的縮短, 這也讓go更適合編寫網絡服務程序.
接下來說解go中的三色GC的實現原理.---基於1.9.2版本html

基礎概念

內存結構

go在程序啓動時會分配一塊虛擬內存地址是連續的內存, 結構以下:node

這一塊內存分爲了3個區域, 在X64上大小分別是512M, 16G和512G, 它們的做用以下linux

arenac++

arena區域就是咱們一般說的heap, go從heap分配的內存都在這個區域中. git

bitmapgithub

bitmap區域用於表示arena區域中哪些地址保存了對象, 而且對象中哪些地址包含了指針.
bitmap區域中一個byte(8 bit)對應了arena區域中的四個指針大小的內存, 也就是2 bit對應一個指針大小的內存.
因此bitmap區域的大小是 512GB / 指針大小(8 byte) / 4 = 16GB.golang

bitmap區域中的一個byte對應arena區域的四個指針大小的內存的結構以下,
每個指針大小的內存都會有兩個bit分別表示是否應該繼續掃描和是否包含指針:web

bitmap中的byte和arena的對應關係從末尾開始, 也就是隨着內存分配會向兩邊擴展:算法

spansjson

spans區域用於表示arena區中的某一頁(Page)屬於哪一個span, 什麼是span將在下面介紹.
spans區域中一個指針(8 byte)對應了arena區域中的一頁(在go中一頁=8KB).
因此spans的大小是 512GB / 頁大小(8KB) * 指針大小(8 byte) = 512MB.

spans區域的一個指針對應arena區域的一頁的結構以下, 和bitmap不同的是對應關係會從開頭開始:

何時從Heap分配對象

不少講解go的文章和書籍中都提到過, go會自動肯定哪些對象應該放在棧上, 哪些對象應該放在堆上.
簡單的來講, 當一個對象的內容可能在生成該對象的函數結束後被訪問, 那麼這個對象就會分配在堆上.
在堆上分配對象的狀況包括:

返回對象的指針
傳遞了對象的指針到其餘函數
在閉包中使用了對象而且須要修改對象
使用new

在C語言中函數返回在棧上的對象的指針是很是危險的事情, 但在go中倒是安全的, 由於這個對象會自動在堆上分配.
go決定是否使用堆分配對象的過程也叫"逃逸分析".

GC Bitmap

GC在標記時須要知道哪些地方包含了指針, 例如上面提到的bitmap區域涵蓋了arena區域中的指針信息.
除此以外, GC還須要知道棧空間上哪些地方包含了指針,
由於棧空間不屬於arena區域, 棧空間的指針信息將會在函數信息裏面.
另外, GC在分配對象時也須要根據對象的類型設置bitmap區域, 來源的指針信息將會在類型信息裏面.

總結起來go中有如下的GC Bitmap:

bitmap區域: 涵蓋了arena區域, 使用2 bit表示一個指針大小的內存
函數信息: 涵蓋了函數的棧空間, 使用1 bit表示一個指針大小的內存 (位於stackmap.bytedata)
類型信息: 在分配對象時會複製到bitmap區域, 使用1 bit表示一個指針大小的內存 (位於_type.gcdata)

Span

span是用於分配對象的區塊, 下圖是簡單說明了Span的內部結構:

一般一個span包含了多個大小相同的元素, 一個元素會保存一個對象, 除非:

span用於保存大對象, 這種狀況span只有一個元素
span用於保存極小對象且不包含指針的對象(tiny object), 這種狀況span會用一個元素保存多個對象

span中有一個freeindex標記下一次分配對象時應該開始搜索的地址, 分配後freeindex會增長,
在freeindex以前的元素都是已分配的, 在freeindex以後的元素有可能已分配, 也有可能未分配.

span每次GC之後均可能會回收掉一些元素, allocBits用於標記哪些元素是已分配的, 哪些元素是未分配的.
使用freeindex + allocBits能夠在分配時跳過已分配的元素, 把對象設置在未分配的元素中,

但由於每次都去訪問allocBits效率會比較慢, span中有一個整數型的allocCache用於緩存freeindex開始的bitmap, 緩存的bit值與原值相反.

gcmarkBits用於在gc時標記哪些對象存活, 每次gc之後gcmarkBits會變爲allocBits.
須要注意的是span結構自己的內存是從系統分配的, 上面提到的spans區域和bitmap區域都只是一個索引.

Span的類型

span根據大小能夠分爲67個類型, 以下:

// class  bytes/obj  bytes/span  objects  tail waste  max waste
//     1          8        8192     1024           0     87.50%
//     2         16        8192      512           0     43.75%
//     3         32        8192      256           0     46.88%
//     4         48        8192      170          32     31.52%
//     5         64        8192      128           0     23.44%
//     6         80        8192      102          32     19.07%
//     7         96        8192       85          32     15.95%
//     8        112        8192       73          16     13.56%
//     9        128        8192       64           0     11.72%
//    10        144        8192       56         128     11.82%
//    11        160        8192       51          32      9.73%
//    12        176        8192       46          96      9.59%
//    13        192        8192       42         128      9.25%
//    14        208        8192       39          80      8.12%
//    15        224        8192       36         128      8.15%
//    16        240        8192       34          32      6.62%
//    17        256        8192       32           0      5.86%
//    18        288        8192       28         128     12.16%
//    19        320        8192       25         192     11.80%
//    20        352        8192       23          96      9.88%
//    21        384        8192       21         128      9.51%
//    22        416        8192       19         288     10.71%
//    23        448        8192       18         128      8.37%
//    24        480        8192       17          32      6.82%
//    25        512        8192       16           0      6.05%
//    26        576        8192       14         128     12.33%
//    27        640        8192       12         512     15.48%
//    28        704        8192       11         448     13.93%
//    29        768        8192       10         512     13.94%
//    30        896        8192        9         128     15.52%
//    31       1024        8192        8           0     12.40%
//    32       1152        8192        7         128     12.41%
//    33       1280        8192        6         512     15.55%
//    34       1408       16384       11         896     14.00%
//    35       1536        8192        5         512     14.00%
//    36       1792       16384        9         256     15.57%
//    37       2048        8192        4           0     12.45%
//    38       2304       16384        7         256     12.46%
//    39       2688        8192        3         128     15.59%
//    40       3072       24576        8           0     12.47%
//    41       3200       16384        5         384      6.22%
//    42       3456       24576        7         384      8.83%
//    43       4096        8192        2           0     15.60%
//    44       4864       24576        5         256     16.65%
//    45       5376       16384        3         256     10.92%
//    46       6144       24576        4           0     12.48%
//    47       6528       32768        5         128      6.23%
//    48       6784       40960        6         256      4.36%
//    49       6912       49152        7         768      3.37%
//    50       8192        8192        1           0     15.61%
//    51       9472       57344        6         512     14.28%
//    52       9728       49152        5         512      3.64%
//    53      10240       40960        4           0      4.99%
//    54      10880       32768        3         128      6.24%
//    55      12288       24576        2           0     11.45%
//    56      13568       40960        3         256      9.99%
//    57      14336       57344        4           0      5.35%
//    58      16384       16384        1           0     12.49%
//    59      18432       73728        4           0     11.11%
//    60      19072       57344        3         128      3.57%
//    61      20480       40960        2           0      6.87%
//    62      21760       65536        3         256      6.25%
//    63      24576       24576        1           0     11.45%
//    64      27264       81920        3         128     10.00%
//    65      28672       57344        2           0      4.91%
//    66      32768       32768        1           0     12.50%

在分配對象時, 會根據對象的大小決定使用什麼類型的span,
例如16 byte的對象會使用span 2, 17 byte的對象會使用span 3, 32 byte的對象會使用span 3.
從這個例子也能夠看到, 分配17和32 byte的對象都會使用span 3, 也就是說部分大小的對象在分配時會浪費必定的空間.

有人可能會注意到, 上面最大的span的元素大小是32K, 那麼分配超過32K的對象會在哪裏分配呢?
超過32K的對象稱爲"大對象", 分配大對象時, 會直接從heap分配一個特殊的span,
這個特殊的span的類型(class)是0, 只包含了一個大對象, span的大小由對象的大小決定.

特殊的span加上的66個標準的span, 一共組成了67個span類型.

Span的位置

在前一篇中我提到了P是一個虛擬的資源, 同一時間只能有一個線程訪問同一個P, 因此P中的數據不須要鎖.
爲了分配對象時有更好的性能, 各個P中都有span的緩存(也叫mcache), 緩存的結構以下:

各個P中按span類型的不一樣, 有67*2=134個span的緩存,

其中scan和noscan的區別在於,
若是對象包含了指針, 分配對象時會使用scan的span,
若是對象不包含指針, 分配對象時會使用noscan的span.
把span分爲scan和noscan的意義在於,
GC掃描對象的時候對於noscan的span能夠不去查看bitmap區域來標記子對象, 這樣能夠大幅提高標記的效率.

在分配對象時將會從如下的位置獲取適合的span用於分配:

首先從P的緩存(mcache)獲取, 若是有緩存的span而且未滿則使用, 這個步驟不須要鎖
而後從全局緩存(mcentral)獲取, 若是獲取成功則設置到P, 這個步驟須要鎖
最後從mheap獲取, 獲取後設置到全局緩存, 這個步驟須要鎖

在P中緩存span的作法跟CoreCLR中線程緩存分配上下文(Allocation Context)的作法類似,
均可以讓分配對象時大部分時候不須要線程鎖, 改進分配的性能.

分配對象的處理

分配對象的流程

go從堆分配對象時會調用newobject函數, 這個函數的流程大體以下:

首先會檢查GC是否在工做中, 若是GC在工做中而且當前的G分配了必定大小的內存則須要協助GC作必定的工做,
這個機制叫GC Assist, 用於防止分配內存太快致使GC回收跟不上的狀況發生.

以後會判斷是小對象仍是大對象, 若是是大對象則直接調用largeAlloc從堆中分配,
若是是小對象分3個階段獲取可用的span, 而後從span中分配對象:

首先從P的緩存(mcache)獲取
而後從全局緩存(mcentral)獲取, 全局緩存中有可用的span的列表
最後從mheap獲取, mheap中也有span的自由列表, 若是都獲取失敗則從arena區域分配

這三個階段的詳細結構以下圖:

數據類型的定義:

分配對象涉及的數據類型包含:

p: 前一篇提到過, P是協程中的用於運行go代碼的虛擬資源
m: 前一篇提到過, M目前表明系統線程
g: 前一篇提到過, G就是goroutine
mspan: 用於分配對象的區塊
mcentral: 全局的mspan緩存, 一共有67*2=134個
mheap: 用於管理heap的對象, 全局只有一個

源代碼分析

go從堆分配對象時會調用newobject函數, 先從這個函數看起:

// implementation of new builtin
// compiler (both frontend and SSA backend) knows the signature
// of this function
func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}

　　newobject調用了mallocgc函數:

// Allocate an object of size bytes.
// Small objects are allocated from the per-P cache's free lists.
// Large objects (> 32 kB) are allocated straight from the heap.
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    if gcphase == _GCmarktermination {
        throw("mallocgc called with gcphase == _GCmarktermination")
    }
 
    if size == 0 {
        return unsafe.Pointer(&zerobase)
    }
 
    if debug.sbrk != 0 {
        align := uintptr(16)
        if typ != nil {
            align = uintptr(typ.align)
        }
        return persistentalloc(size, align, &memstats.other_sys)
    }
 
    // 判斷是否要輔助GC工做
    // gcBlackenEnabled在GC的標記階段會開啓
    // assistG is the G to charge for this allocation, or nil if
    // GC is not currently active.
    var assistG *g
    if gcBlackenEnabled != 0 {
        // Charge the current user G for this allocation.
        assistG = getg()
        if assistG.m.curg != nil {
            assistG = assistG.m.curg
        }
        // Charge the allocation against the G. We'll account
        // for internal fragmentation at the end of mallocgc.
        assistG.gcAssistBytes -= int64(size)
 
        // 會按分配的大小判斷須要協助GC完成多少工做
        // 具體的算法將在下面講解收集器時說明
        if assistG.gcAssistBytes < 0 {
            // This G is in debt. Assist the GC to correct
            // this before allocating. This must happen
            // before disabling preemption.
            gcAssistAlloc(assistG)
        }
    }
 
    // 增長當前G對應的M的lock計數, 防止這個G被搶佔
    // Set mp.mallocing to keep from being preempted by GC.
    mp := acquirem()
    if mp.mallocing != 0 {
        throw("malloc deadlock")
    }
    if mp.gsignal == getg() {
        throw("malloc during signal")
    }
    mp.mallocing = 1
 
    shouldhelpgc := false
    dataSize := size
    // 獲取當前G對應的M對應的P的本地span緩存(mcache)
    // 由於M在擁有P後會把P的mcache設到M中, 這裏返回的是getg().m.mcache
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    // 判斷是否小對象, maxSmallSize當前的值是32K
    if size <= maxSmallSize {
        // 若是對象不包含指針, 而且對象的大小小於16 bytes, 能夠作特殊處理
        // 這裏是針對很是小的對象的優化, 由於span的元素最小隻能是8 byte, 若是對象更小那麼不少空間都會被浪費掉
        // 很是小的對象能夠整合在"class 2 noscan"的元素(大小爲16 byte)中
        if noscan && size < maxTinySize {
            // Tiny allocator.
            //
            // Tiny allocator combines several tiny allocation requests
            // into a single memory block. The resulting memory block
            // is freed when all subobjects are unreachable. The subobjects
            // must be noscan (don't have pointers), this ensures that
            // the amount of potentially wasted memory is bounded.
            //
            // Size of the memory block used for combining (maxTinySize) is tunable.
            // Current setting is 16 bytes, which relates to 2x worst case memory
            // wastage (when all but one subobjects are unreachable).
            // 8 bytes would result in no wastage at all, but provides less
            // opportunities for combining.
            // 32 bytes provides more opportunities for combining,
            // but can lead to 4x worst case wastage.
            // The best case winning is 8x regardless of block size.
            //
            // Objects obtained from tiny allocator must not be freed explicitly.
            // So when an object will be freed explicitly, we ensure that
            // its size >= maxTinySize.
            //
            // SetFinalizer has a special case for objects potentially coming
            // from tiny allocator, it such case it allows to set finalizers
            // for an inner byte of a memory block.
            //
            // The main targets of tiny allocator are small strings and
            // standalone escaping variables. On a json benchmark
            // the allocator reduces number of allocations by ~12% and
            // reduces heap size by ~20%.
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySpanClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySpanClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            // 不然按普通的小對象分配
            // 首先獲取對象的大小應該使用哪一個span類型
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            // 等於sizeclass * 2 + (noscan ? 1 : 0)
            spc := makeSpanClass(sizeclass, noscan)
            span := c.alloc[spc]
            // 嘗試快速的從這個span中分配
            v := nextFreeFast(span)
            if v == 0 {
                // 分配失敗, 可能須要從mcentral或者mheap中獲取
                // 若是從mcentral或者mheap獲取了新的span, 則shouldhelpgc會等於true
                // shouldhelpgc會等於true時會在下面判斷是否要觸發GC
                v, span, shouldhelpgc = c.nextFree(spc)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        // 大對象直接從mheap分配, 這裏的s是一個特殊的span, 它的class是0
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero, noscan)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
 
    // 設置arena對應的bitmap, 記錄哪些位置包含了指針, GC會使用bitmap掃描全部可到達的對象
    var scanSize uintptr
    if !noscan {
        // If allocating a defer+arg block, now that we've picked a malloc size
        // large enough to hold everything, cut the "asked for" size down to
        // just the defer header, so that the GC bitmap will record the arg block
        // as containing nothing at all (as if it were unused space at the end of
        // a malloc block caused by size rounding).
        // The defer arg areas are scanned as part of scanstack.
        if typ == deferType {
            dataSize = unsafe.Sizeof(_defer{})
        }
        // 這個函數很是的長, 有興趣的能夠看
        // https://github.com/golang/go/blob/go1.9.2/src/runtime/mbitmap.go#L855
        // 雖然代碼很長可是設置的內容跟上面說過的bitmap區域的結構同樣
        // 根據類型信息設置scan bit跟pointer bit, scan bit成立表示應該繼續掃描, pointer bit成立表示該位置是指針
        // 須要注意的地方有
        // - 若是一個類型只有開頭的地方包含指針, 例如[ptr, ptr, large non-pointer data]
        //   那麼後面的部分的scan bit將會爲0, 這樣能夠大幅提高標記的效率
        // - 第二個slot的scan bit用途比較特殊, 它並不用於標記是否繼續scan, 而是標記checkmark
        // 什麼是checkmark
        // - 由於go的並行GC比較複雜, 爲了檢查實現是否正確, go須要在有一個檢查全部應該被標記的對象是否被標記的機制
        //   這個機制就是checkmark, 在開啓checkmark時go會在標記階段的最後中止整個世界而後從新執行一次標記
        //   上面的第二個slot的scan bit就是用於標記對象在checkmark標記中是否被標記的
        // - 有的人可能會發現第二個slot要求對象最少有兩個指針的大小, 那麼只有一個指針的大小的對象呢?
        //   只有一個指針的大小的對象能夠分爲兩種狀況
        //   對象就是指針, 由於大小恰好是1個指針因此並不須要看bitmap區域, 這時第一個slot就是checkmark
        //   對象不是指針, 由於有tiny alloc的機制, 不是指針且只有一個指針大小的對象會分配在兩個指針的span中
        //               這時候也不須要看bitmap區域, 因此和上面同樣第一個slot就是checkmark
        heapBitsSetType(uintptr(x), size, dataSize, typ)
        if dataSize > typ.size {
            // Array allocation. If there are any
            // pointers, GC has to scan to the last
            // element.
            if typ.ptrdata != 0 {
                scanSize = dataSize - typ.size + typ.ptrdata
            }
        } else {
            scanSize = typ.ptrdata
        }
        c.local_scan += scanSize
    }
 
    // 內存屏障, 由於x86和x64的store不會亂序因此這裏只是個針對編譯器的屏障, 彙編中是ret
    // Ensure that the stores above that initialize x to
    // type-safe memory and set the heap bits occur before
    // the caller can make x observable to the garbage
    // collector. Otherwise, on weakly ordered machines,
    // the garbage collector could follow a pointer to x,
    // but see uninitialized memory or stale heap bits.
    publicationBarrier()
 
    // 若是當前在GC中, 須要馬上標記分配後的對象爲"黑色", 防止它被回收
    // Allocate black during GC.
    // All slots hold nil so no scanning is needed.
    // This may be racing with GC so do it atomically if there can be
    // a race marking the bit.
    if gcphase != _GCoff {
        gcmarknewobject(uintptr(x), size, scanSize)
    }
 
    // Race Detector的處理(用於檢測線程衝突問題)
    if raceenabled {
        racemalloc(x, size)
    }
 
    // Memory Sanitizer的處理(用於檢測危險指針等內存問題)
    if msanenabled {
        msanmalloc(x, size)
    }
 
    // 從新容許當前的G被搶佔
    mp.mallocing = 0
    releasem(mp)
 
    // 除錯記錄
    if debug.allocfreetrace != 0 {
        tracealloc(x, size, typ)
    }
 
    // Profiler記錄
    if rate := MemProfileRate; rate > 0 {
        if size < uintptr(rate) && int32(size) < c.next_sample {
            c.next_sample -= int32(size)
        } else {
            mp := acquirem()
            profilealloc(mp, x, size)
            releasem(mp)
        }
    }
 
    // gcAssistBytes減去"實際分配大小 - 要求分配大小", 調整到準確值
    if assistG != nil {
        // Account for internal fragmentation in the assist
        // debt now that we know it.
        assistG.gcAssistBytes -= int64(size - dataSize)
    }
 
    // 若是以前獲取了新的span, 則判斷是否須要後臺啓動GC
    // 這裏的判斷邏輯(gcTrigger)會在下面詳細說明
    if shouldhelpgc {
        if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
            gcStart(gcBackgroundMode, t)
        }
    }
 
    return x
}

　　接下來看看如何從span裏面分配對象, 首先會調用nextFreeFast嘗試快速分配:

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    // 獲取第一個非0的bit是第幾個bit, 也就是哪一個元素是未分配的
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    // 找到未分配的元素
    if theBit < 64 {
        result := s.freeindex + uintptr(theBit)
        // 要求索引值小於元素數量
        if result < s.nelems {
            // 下一個freeindex
            freeidx := result + 1
            // 能夠被64整除時須要特殊處理(參考nextFree)
            if freeidx%64 == 0 && freeidx != s.nelems {
                return 0
            }
            // 更新freeindex和allocCache(高位都是0, 用盡之後會更新)
            s.allocCache >>= uint(theBit + 1)
            s.freeindex = freeidx
            // 返回元素所在的地址
            v := gclinkptr(result*s.elemsize + s.base())
            // 添加已分配的元素計數
            s.allocCount++
            return v
        }
    }
    return 0
}

　　若是在freeindex後沒法快速找到未分配的元素, 就須要調用nextFree作出更復雜的處理:

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    // 找到下一個freeindex和更新allocCache
    s = c.alloc[spc]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    // 若是span裏面全部元素都已分配, 則須要獲取新的span
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {
            println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        // 申請新的span
        systemstack(func() {
            c.refill(spc)
        })
        // 獲取申請後的新的span, 並設置須要檢查是否執行GC
        shouldhelpgc = true
        s = c.alloc[spc]
 
        freeIndex = s.nextFreeIndex()
    }
 
    if freeIndex >= s.nelems {
        throw("freeIndex is not valid")
    }
 
    // 返回元素所在的地址
    v = gclinkptr(freeIndex*s.elemsize + s.base())
    // 添加已分配的元素計數
    s.allocCount++
    if uintptr(s.allocCount) > s.nelems {
        println("s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
        throw("s.allocCount > s.nelems")
}

　　若是mcache中指定類型的span已滿, 就須要調用refill函數申請新的span:

// Gets a span that has a free object in it and assigns it
// to be the cached span for the given sizeclass. Returns this span.
func (c *mcache) refill(spc spanClass) *mspan {
    _g_ := getg()
 
    // 防止G被搶佔
    _g_.m.locks++
    // Return the current cached span to the central lists.
    s := c.alloc[spc]
 
    // 確保當前的span全部元素都已分配
    if uintptr(s.allocCount) != s.nelems {
        throw("refill of span with free space remaining")
    }
 
    // 設置span的incache屬性, 除非是全局使用的空span(也就是mcache裏面span指針的默認值)
    if s != &emptymspan {
        s.incache = false
    }
 
    // 向mcentral申請一個新的span
    // Get a new cached span from the central lists.
    s = mheap_.central[spc].mcentral.cacheSpan()
    if s == nil {
        throw("out of memory")
    }
 
    if uintptr(s.allocCount) == s.nelems {
        throw("span has no free space")
    }
 
    // 設置新的span到mcache中
    c.alloc[spc] = s
    // 容許G被搶佔
    _g_.m.locks--
    return s
}

向mcentral申請一個新的span會經過cacheSpan函數:
mcentral首先嚐試從內部的鏈表複用原有的span, 若是複用失敗則向mheap申請.

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    // 讓當前G協助一部分的sweep工做
    // Deduct credit for this span allocation and sweep if necessary.
    spanBytes := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) * _PageSize
    deductSweepCredit(spanBytes, 0)
 
    // 對mcentral上鎖, 由於可能會有多個M(P)同時訪問
    lock(&c.lock)
    traceDone := false
    if trace.enabled {
        traceGCSweepStart()
    }
    sg := mheap_.sweepgen
retry:
    // mcentral裏面有兩個span的鏈表
    // - nonempty表示肯定該span最少有一個未分配的元素
    // - empty表示不肯定該span最少有一個未分配的元素
    // 這裏優先查找nonempty的鏈表
    // sweepgen每次GC都會增長2
    // - sweepgen == 全局sweepgen, 表示span已經sweep過
    // - sweepgen == 全局sweepgen-1, 表示span正在sweep
    // - sweepgen == 全局sweepgen-2, 表示span等待sweep
    var s *mspan
    for s = c.nonempty.first; s != nil; s = s.next {
        // 若是span等待sweep, 嘗試原子修改sweepgen爲全局sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 修改爲功則把span移到empty鏈表, sweep它而後跳到havespan
            c.nonempty.remove(s)
            c.empty.insertBack(s)
            unlock(&c.lock)
            s.sweep(true)
            goto havespan
        }
        // 若是這個span正在被其餘線程sweep, 就跳過
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // span已經sweep過
        // 由於nonempty鏈表中的span肯定最少有一個未分配的元素, 這裏能夠直接使用它
        // we have a nonempty span that does not require sweeping, allocate from it
        c.nonempty.remove(s)
        c.empty.insertBack(s)
        unlock(&c.lock)
        goto havespan
    }
 
    // 查找empty的鏈表
    for s = c.empty.first; s != nil; s = s.next {
        // 若是span等待sweep, 嘗試原子修改sweepgen爲全局sweepgen-1
        if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            // 把span放到empty鏈表的最後
            // we have an empty span that requires sweeping,
            // sweep it and see if we can free some space in it
            c.empty.remove(s)
            // swept spans are at the end of the list
            c.empty.insertBack(s)
            unlock(&c.lock)
            // 嘗試sweep
            s.sweep(true)
            // sweep之後還須要檢測是否有未分配的對象, 若是有則可使用它
            freeIndex := s.nextFreeIndex()
            if freeIndex != s.nelems {
                s.freeindex = freeIndex
                goto havespan
            }
            lock(&c.lock)
            // the span is still empty after sweep
            // it is already in the empty list, so just retry
            goto retry
        }
        // 若是這個span正在被其餘線程sweep, 就跳過
        if s.sweepgen == sg-1 {
            // the span is being swept by background sweeper, skip
            continue
        }
        // 找不到有未分配對象的span
        // already swept empty span,
        // all subsequent ones must also be either swept or in process of sweeping
        break
    }
    if trace.enabled {
        traceGCSweepDone()
        traceDone = true
    }
    unlock(&c.lock)
 
    // 找不到有未分配對象的span, 須要從mheap分配
    // 分配完成後加到empty鏈表中
    // Replenish central list if empty.
    s = c.grow()
    if s == nil {
        return nil
    }
    lock(&c.lock)
    c.empty.insertBack(s)
    unlock(&c.lock)
 
    // At this point s is a non-empty span, queued at the end of the empty list,
    // c is unlocked.
havespan:
    if trace.enabled && !traceDone {
        traceGCSweepDone()
    }
    // 統計span中未分配的元素數量, 加到mcentral.nmalloc中
    // 統計span中未分配的元素總大小, 加到memstats.heap_live中
    cap := int32((s.npages << _PageShift) / s.elemsize)
    n := cap - int32(s.allocCount)
    if n == 0 || s.freeindex == s.nelems || uintptr(s.allocCount) == s.nelems {
        throw("span has no free objects")
    }
    // Assume all objects from this span will be allocated in the
    // mcache. If it gets uncached, we'll adjust this.
    atomic.Xadd64(&c.nmalloc, int64(n))
    usedBytes := uintptr(s.allocCount) * s.elemsize
    atomic.Xadd64(&memstats.heap_live, int64(spanBytes)-int64(usedBytes))
    // 跟蹤處理
    if trace.enabled {
        // heap_live changed.
        traceHeapAlloc()
    }
    // 若是當前在GC中, 由於heap_live改變了, 從新調整G輔助標記工做的值
    // 詳細請參考下面對revise函數的解析
    if gcBlackenEnabled != 0 {
        // heap_live changed.
        gcController.revise()
    }
    // 設置span的incache屬性, 表示span正在mcache中
    s.incache = true
    // 根據freeindex更新allocCache
    freeByteBase := s.freeindex &^ (64 - 1)
    whichByte := freeByteBase / 8
    // Init alloc bits cache.
    s.refillAllocCache(whichByte)
 
    // Adjust the allocCache so that s.freeindex corresponds to the low bit in
    // s.allocCache.
    s.allocCache >>= s.freeindex % 64
 
    return s
}

　　mcentral向mheap申請一個新的span會使用grow函數:

// grow allocates a new empty span from the heap and initializes it for c's size class.
func (c *mcentral) grow() *mspan {
    // 根據mcentral的類型計算須要申請的span的大小(除以8K = 有多少頁)和能夠保存多少個元素
    npages := uintptr(class_to_allocnpages[c.spanclass.sizeclass()])
    size := uintptr(class_to_size[c.spanclass.sizeclass()])
    n := (npages << _PageShift) / size
 
    // 向mheap申請一個新的span, 以頁(8K)爲單位
    s := mheap_.alloc(npages, c.spanclass, false, true)
    if s == nil {
        return nil
    }
 
    p := s.base()
    s.limit = p + size*n
 
    // 分配並初始化span的allocBits和gcmarkBits
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

　　mheap分配span的函數是alloc:

func (h *mheap) alloc(npage uintptr, spanclass spanClass, large bool, needzero bool) *mspan {
    // 在g0的棧空間中調用alloc_m函數
    // 關於systemstack的說明請看前一篇文章
    // Don't do any operations that lock the heap on the G stack.
    // It might trigger stack growth, and the stack growth code needs
    // to be able to allocate heap.
    var s *mspan
    systemstack(func() {
        s = h.alloc_m(npage, spanclass, large)
    })
 
    if s != nil {
        if needzero && s.needzero != 0 {
            memclrNoHeapPointers(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        }
        s.needzero = 0
    }
    return s
}

　　alloc函數會在g0的棧空間中調用alloc_m函數:

// Allocate a new span of npage pages from the heap for GC'd memory
// and record its size class in the HeapMap and HeapMapCache.
func (h *mheap) alloc_m(npage uintptr, spanclass spanClass, large bool) *mspan {
    _g_ := getg()
    if _g_ != _g_.m.g0 {
        throw("_mheap_alloc not on g0 stack")
    }
    // 對mheap上鎖, 這裏的鎖是全局鎖
    lock(&h.lock)
 
    // 爲了防止heap增速太快, 在分配n頁以前要先sweep和回收n頁
    // 會先枚舉busy列表而後再枚舉busyLarge列表進行sweep, 具體參考reclaim和reclaimList函數
    // To prevent excessive heap growth, before allocating n pages
    // we need to sweep and reclaim at least n pages.
    if h.sweepdone == 0 {
        // TODO(austin): This tends to sweep a large number of
        // spans in order to find a few completely free spans
        // (for example, in the garbage benchmark, this sweeps
        // ~30x the number of pages its trying to allocate).
        // If GC kept a bit for whether there were any marks
        // in a span, we could release these free spans
        // at the end of GC and eliminate this entirely.
        if trace.enabled {
            traceGCSweepStart()
        }
        h.reclaim(npage)
        if trace.enabled {
            traceGCSweepDone()
        }
    }
 
    // 把mcache中的本地統計數據加到全局
    // transfer stats from cache to global
    memstats.heap_scan += uint64(_g_.m.mcache.local_scan)
    _g_.m.mcache.local_scan = 0
    memstats.tinyallocs += uint64(_g_.m.mcache.local_tinyallocs)
    _g_.m.mcache.local_tinyallocs = 0
 
    // 調用allocSpanLocked分配span, allocSpanLocked函數要求當前已經對mheap上鎖
    s := h.allocSpanLocked(npage, &memstats.heap_inuse)
    if s != nil {
        // Record span info, because gc needs to be
        // able to map interior pointer to containing span.
        // 設置span的sweepgen = 全局sweepgen
        atomic.Store(&s.sweepgen, h.sweepgen)
        // 放到全局span列表中, 這裏的sweepSpans的長度是2
        // sweepSpans[h.sweepgen/2%2]保存當前正在使用的span列表
        // sweepSpans[1-h.sweepgen/2%2]保存等待sweep的span列表
        // 由於每次gcsweepgen都會加2, 每次gc這兩個列表都會交換
        h.sweepSpans[h.sweepgen/2%2].push(s) // Add to swept in-use list.
        // 初始化span成員
        s.state = _MSpanInUse
        s.allocCount = 0
        s.spanclass = spanclass
        if sizeclass := spanclass.sizeclass(); sizeclass == 0 {
            s.elemsize = s.npages << _PageShift
            s.divShift = 0
            s.divMul = 0
            s.divShift2 = 0
            s.baseMask = 0
        } else {
            s.elemsize = uintptr(class_to_size[sizeclass])
            m := &class_to_divmagic[sizeclass]
            s.divShift = m.shift
            s.divMul = m.mul
            s.divShift2 = m.shift2
            s.baseMask = m.baseMask
        }
 
        // update stats, sweep lists
        h.pagesInUse += uint64(npage)
        // 上面grow函數會傳入true, 也就是經過grow調用到這裏large會等於true
        // 添加已分配的span到busy列表, 若是頁數超過_MaxMHeapList(128頁=8K*128=1M)則放到busylarge列表
        if large {
            memstats.heap_objects++
            mheap_.largealloc += uint64(s.elemsize)
            mheap_.nlargealloc++
            atomic.Xadd64(&memstats.heap_live, int64(npage<<_PageShift))
            // Swept spans are at the end of lists.
            if s.npages < uintptr(len(h.busy)) {
                h.busy[s.npages].insertBack(s)
            } else {
                h.busylarge.insertBack(s)
            }
        }
    }
    // 若是當前在GC中, 由於heap_live改變了, 從新調整G輔助標記工做的值
    // 詳細請參考下面對revise函數的解析
    // heap_scan and heap_live were updated.
    if gcBlackenEnabled != 0 {
        gcController.revise()
    }
 
    // 跟蹤處理
    if trace.enabled {
        traceHeapAlloc()
    }
 
    // h.spans is accessed concurrently without synchronization
    // from other threads. Hence, there must be a store/store
    // barrier here to ensure the writes to h.spans above happen
    // before the caller can publish a pointer p to an object
    // allocated from s. As soon as this happens, the garbage
    // collector running on another processor could read p and
    // look up s in h.spans. The unlock acts as the barrier to
    // order these writes. On the read side, the data dependency
    // between p and the index in h.spans orders the reads.
    unlock(&h.lock)
    return s
}

　　繼續查看allocSpanLocked函數:

// Allocates a span of the given size.  h must be locked.
// The returned span has been removed from the
// free list, but its state is still MSpanFree.
func (h *mheap) allocSpanLocked(npage uintptr, stat *uint64) *mspan {
    var list *mSpanList
    var s *mspan
 
    // 嘗試在mheap中的自由列表分配
    // 頁數小於_MaxMHeapList(128頁=1M)的自由span都會在free列表中
    // 頁數大於_MaxMHeapList的自由span都會在freelarge列表中
    // Try in fixed-size lists up to max.
    for i := int(npage); i < len(h.free); i++ {
        list = &h.free[i]
        if !list.isEmpty() {
            s = list.first
            list.remove(s)
            goto HaveSpan
        }
    }
    // free列表找不到則查找freelarge列表
    // 查找不到就向arena區域申請一個新的span加到freelarge中, 而後再查找freelarge列表
    // Best fit in list of large spans.
    s = h.allocLarge(npage) // allocLarge removed s from h.freelarge for us
    if s == nil {
        if !h.grow(npage) {
            return nil
        }
        s = h.allocLarge(npage)
        if s == nil {
            return nil
        }
    }
 
HaveSpan:
    // Mark span in use.
    if s.state != _MSpanFree {
        throw("MHeap_AllocLocked - MSpan not free")
    }
    if s.npages < npage {
        throw("MHeap_AllocLocked - bad npages")
    }
    // 若是span有已釋放(解除虛擬內存和物理內存關係)的頁, 提醒這些頁會被使用而後更新統計數據
    if s.npreleased > 0 {
        sysUsed(unsafe.Pointer(s.base()), s.npages<<_PageShift)
        memstats.heap_released -= uint64(s.npreleased << _PageShift)
        s.npreleased = 0
    }
 
    // 若是獲取到的span頁數比要求的頁數多
    // 分割剩餘的頁數到另外一個span而且放到自由列表中
    if s.npages > npage {
        // Trim extra and put it back in the heap.
        t := (*mspan)(h.spanalloc.alloc())
        t.init(s.base()+npage<<_PageShift, s.npages-npage)
        s.npages = npage
        p := (t.base() - h.arena_start) >> _PageShift
        if p > 0 {
            h.spans[p-1] = s
        }
        h.spans[p] = t
        h.spans[p+t.npages-1] = t
        t.needzero = s.needzero
        s.state = _MSpanManual // prevent coalescing with s
        t.state = _MSpanManual
        h.freeSpanLocked(t, false, false, s.unusedsince)
        s.state = _MSpanFree
    }
    s.unusedsince = 0
 
    // 設置spans區域, 哪些地址對應哪一個mspan對象
    p := (s.base() - h.arena_start) >> _PageShift
    for n := uintptr(0); n < npage; n++ {
        h.spans[p+n] = s
    }
 
    // 更新統計數據
    *stat += uint64(npage << _PageShift)
    memstats.heap_idle -= uint64(npage << _PageShift)
 
    //println("spanalloc", hex(s.start<<_PageShift))
    if s.inList() {
        throw("still in list")
    }
    return s
}

　　繼續查看allocLarge函數:

// allocLarge allocates a span of at least npage pages from the treap of large spans.
// Returns nil if no such span currently exists.
func (h *mheap) allocLarge(npage uintptr) *mspan {
    // Search treap for smallest span with >= npage pages.
    return h.freelarge.remove(npage)
}

　　freelarge的類型是mTreap, 調用remove函數會在樹裏面搜索一個至少npage且在樹中的最小的span返回:

// remove searches for, finds, removes from the treap, and returns the smallest
// span that can hold npages. If no span has at least npages return nil.
// This is slightly more complicated than a simple binary tree search
// since if an exact match is not found the next larger node is
// returned.
// If the last node inspected > npagesKey not holding
// a left node (a smaller npages) is the "best fit" node.
func (root *mTreap) remove(npages uintptr) *mspan {
    t := root.treap
    for t != nil {
        if t.spanKey == nil {
            throw("treap node with nil spanKey found")
        }
        if t.npagesKey < npages {
            t = t.right
        } else if t.left != nil && t.left.npagesKey >= npages {
            t = t.left
        } else {
            result := t.spanKey
            root.removeNode(t)
            return result
        }
    }
    return nil
}

　　向arena區域申請新span的函數是mheap類的grow函數:

// Try to add at least npage pages of memory to the heap,
// returning whether it worked.
//
// h must be locked.
func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {
        ask = _HeapAllocChunk
    }
 
    // 調用mheap.sysAlloc函數申請
    v := h.sysAlloc(ask)
    if v == nil {
        if ask > npage<<_PageShift {
            ask = npage << _PageShift
            v = h.sysAlloc(ask)
        }
        if v == nil {
            print("runtime: out of memory: cannot allocate ", ask, "-byte block (", memstats.heap_sys, " in use)\n")
            return false
        }
    }
 
    // 建立一個新的span並加到自由列表中
    // Create a fake "in use" span and free it, so that the
    // right coalescing happens.
    s := (*mspan)(h.spanalloc.alloc())
    s.init(uintptr(v), ask>>_PageShift)
    p := (s.base() - h.arena_start) >> _PageShift
    for i := p; i < p+s.npages; i++ {
        h.spans[i] = s
    }
    atomic.Store(&s.sweepgen, h.sweepgen)
    s.state = _MSpanInUse
    h.pagesInUse += uint64(s.npages)
    h.freeSpanLocked(s, false, true, 0)
    return true
}

　　繼續查看mheap的sysAlloc函數:

// sysAlloc allocates the next n bytes from the heap arena. The
// returned pointer is always _PageSize aligned and between
// h.arena_start and h.arena_end. sysAlloc returns nil on failure.
// There is no corresponding free function.
func (h *mheap) sysAlloc(n uintptr) unsafe.Pointer {
    // strandLimit is the maximum number of bytes to strand from
    // the current arena block. If we would need to strand more
    // than this, we fall back to sysAlloc'ing just enough for
    // this allocation.
    const strandLimit = 16 << 20
 
    // 若是arena區域當前已提交的區域不足, 則調用sysReserve預留更多的空間, 而後更新arena_end
    // sysReserve在linux上調用的是mmap函數
    // mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if n > h.arena_end-h.arena_alloc {
        // If we haven't grown the arena to _MaxMem yet, try
        // to reserve some more address space.
        p_size := round(n+_PageSize, 256<<20)
        new_end := h.arena_end + p_size // Careful: can overflow
        if h.arena_end <= new_end && new_end-h.arena_start-1 <= _MaxMem {
            // TODO: It would be bad if part of the arena
            // is reserved and part is not.
            var reserved bool
            p := uintptr(sysReserve(unsafe.Pointer(h.arena_end), p_size, &reserved))
            if p == 0 {
                // TODO: Try smaller reservation
                // growths in case we're in a crowded
                // 32-bit address space.
                goto reservationFailed
            }
            // p can be just about anywhere in the address
            // space, including before arena_end.
            if p == h.arena_end {
                // The new block is contiguous with
                // the current block. Extend the
                // current arena block.
                h.arena_end = new_end
                h.arena_reserved = reserved
            } else if h.arena_start <= p && p+p_size-h.arena_start-1 <= _MaxMem && h.arena_end-h.arena_alloc < strandLimit {
                // We were able to reserve more memory
                // within the arena space, but it's
                // not contiguous with our previous
                // reservation. It could be before or
                // after our current arena_used.
                //
                // Keep everything page-aligned.
                // Our pages are bigger than hardware pages.
                h.arena_end = p + p_size
                p = round(p, _PageSize)
                h.arena_alloc = p
                h.arena_reserved = reserved
            } else {
                // We got a mapping, but either
                //
                // 1) It's not in the arena, so we
                // can't use it. (This should never
                // happen on 32-bit.)
                //
                // 2) We would need to discard too
                // much of our current arena block to
                // use it.
                //
                // We haven't added this allocation to
                // the stats, so subtract it from a
                // fake stat (but avoid underflow).
                //
                // We'll fall back to a small sysAlloc.
                stat := uint64(p_size)
                sysFree(unsafe.Pointer(p), p_size, &stat)
            }
        }
    }
 
    // 預留的空間足夠時只須要增長arena_alloc
    if n <= h.arena_end-h.arena_alloc {
        // Keep taking from our reservation.
        p := h.arena_alloc
        sysMap(unsafe.Pointer(p), n, h.arena_reserved, &memstats.heap_sys)
        h.arena_alloc += n
        if h.arena_alloc > h.arena_used {
            h.setArenaUsed(h.arena_alloc, true)
        }
 
        if p&(_PageSize-1) != 0 {
            throw("misrounded allocation in MHeap_SysAlloc")
        }
        return unsafe.Pointer(p)
    }
 
    // 預留空間失敗後的處理
reservationFailed:
    // If using 64-bit, our reservation is all we have.
    if sys.PtrSize != 4 {
        return nil
    }
 
    // On 32-bit, once the reservation is gone we can
    // try to get memory at a location chosen by the OS.
    p_size := round(n, _PageSize) + _PageSize
    p := uintptr(sysAlloc(p_size, &memstats.heap_sys))
    if p == 0 {
        return nil
    }
 
    if p < h.arena_start || p+p_size-h.arena_start > _MaxMem {
        // This shouldn't be possible because _MaxMem is the
        // whole address space on 32-bit.
        top := uint64(h.arena_start) + _MaxMem
        print("runtime: memory allocated by OS (", hex(p), ") not in usable range [", hex(h.arena_start), ",", hex(top), ")\n")
        sysFree(unsafe.Pointer(p), p_size, &memstats.heap_sys)
        return nil
    }
 
    p += -p & (_PageSize - 1)
    if p+n > h.arena_used {
        h.setArenaUsed(p+n, true)
    }
 
    if p&(_PageSize-1) != 0 {
        throw("misrounded allocation in MHeap_SysAlloc")
    }
    return unsafe.Pointer(p)
}

　　以上就是分配對象的完整流程了, 接下來分析GC標記和回收對象的處理.

回收對象的處理

回收對象的流程

GO的GC是並行GC, 也就是GC的大部分處理和普通的go代碼是同時運行的, 這讓GO的GC流程比較複雜.
首先GC有四個階段, 它們分別是:

Sweep Termination: 對未清掃的span進行清掃, 只有上一輪的GC的清掃工做完成才能夠開始新一輪的GC
Mark: 掃描全部根對象, 和根對象能夠到達的全部對象, 標記它們不被回收
Mark Termination: 完成標記工做, 從新掃描部分根對象(要求STW)
Sweep: 按標記結果清掃span

下圖是比較完整的GC流程, 並按顏色對這四個階段進行了分類:

在GC過程當中會有兩種後臺任務(G), 一種是標記用的後臺任務, 一種是清掃用的後臺任務.
標記用的後臺任務會在須要時啓動, 能夠同時工做的後臺任務數量大約是P的數量的25%, 也就是go所講的讓25%的cpu用在GC上的根據.
清掃用的後臺任務在程序啓動時會啓動一個, 進入清掃階段時喚醒.

目前整個GC流程會進行兩次STW(Stop The World), 第一次是Mark階段的開始, 第二次是Mark Termination階段.
第一次STW會準備根對象的掃描, 啓動寫屏障(Write Barrier)和輔助GC(mutator assist).
第二次STW會從新掃描部分根對象, 禁用寫屏障(Write Barrier)和輔助GC(mutator assist).
須要注意的是, 不是全部根對象的掃描都須要STW, 例如掃描棧上的對象只須要中止擁有該棧的G.
從go 1.9開始, 寫屏障的實現使用了Hybrid Write Barrier, 大幅減小了第二次STW的時間.

GC的觸發條件

GC在知足必定條件後會被觸發, 觸發條件有如下幾種:

gcTriggerAlways: 強制觸發GC
gcTriggerHeap: 當前分配的內存達到必定值就觸發GC
gcTriggerTime: 當必定時間沒有執行過GC就觸發GC
gcTriggerCycle: 要求啓動新一輪的GC, 已啓動則跳過, 手動觸發GC的runtime.GC()會使用這個條件

觸發條件的判斷在gctrigger的test函數.
其中gcTriggerHeap和gcTriggerTime這兩個條件是天然觸發的, gcTriggerHeap的判斷代碼以下:

return memstats.heap_live >= memstats.gc_trigger

heap_live的增長在上面對分配器的代碼分析中能夠看到, 當值達到gc_trigger就會觸發GC, 那麼gc_trigger是如何決定的?
gc_trigger的計算在gcSetTriggerRatio函數中, 公式是:

trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))

當前標記存活的大小乘以1+係數triggerRatio, 就是下次出發GC須要的分配量.
triggerRatio在每次GC後都會調整, 計算triggerRatio的函數是encCycle, 公式是:

const triggerGain = 0.5
// 目標Heap增加率, 默認是1.0
goalGrowthRatio := float64(gcpercent) / 100
// 實際Heap增加率, 等於總大小/存活大小-1
actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1
// GC標記階段的使用時間(由於endCycle是在Mark Termination階段調用的)
assistDuration := nanotime() - c.markStartTime
// GC標記階段的CPU佔用率, 目標值是0.25
utilization := gcGoalUtilization
if assistDuration > 0 {
    // assistTime是G輔助GC標記對象所使用的時間合計
    // (nanosecnds spent in mutator assists during this cycle)
    // 額外的CPU佔用率 = 輔助GC標記對象的總時間 / (GC標記使用時間 * P的數量)
    utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs))
}
// 觸發係數偏移值 = 目標增加率 - 原觸發係數 - CPU佔用率 / 目標CPU佔用率 * (實際增加率 - 原觸發係數)
// 參數的分析:
// 實際增加率越大, 觸發係數偏移值越小, 小於0時下次觸發GC會提前
// CPU佔用率越大, 觸發係數偏移值越小, 小於0時下次觸發GC會提前
// 原觸發係數越大, 觸發係數偏移值越小, 小於0時下次觸發GC會提前
triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio)
// 根據偏移值調整觸發係數, 每次只調整偏移值的一半(漸進式調整)
triggerRatio := memstats.triggerRatio + triggerGain*triggerError

　　公式中的"目標Heap增加率"能夠經過設置環境變量"GOGC"調整, 默認值是100, 增長它的值能夠減小GC的觸發.
設置"GOGC=off"能夠完全關掉GC.

gcTriggerTime的判斷代碼以下:

lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
return lastgc != 0 && t.now-lastgc > forcegcperiod

　　orcegcperiod的定義是2分鐘, 也就是2分鐘內沒有執行過GC就會強制觸發.

三色的定義(黑, 灰, 白)

我看過的對三色GC的"三色"這個概念解釋的最好的文章就是這一篇了, 強烈建議先看這一篇中的講解.
"三色"的概念能夠簡單的理解爲:

黑色: 對象在此次GC中已標記, 且這個對象包含的子對象也已標記
灰色: 對象在此次GC中已標記, 但這個對象包含的子對象未標記
白色: 對象在此次GC中未標記

在go內部對象並無保存顏色的屬性, 三色只是對它們的狀態的描述,
白色的對象在它所在的span的gcmarkBits中對應的bit爲0,
灰色的對象在它所在的span的gcmarkBits中對應的bit爲1, 而且對象在標記隊列中,
黑色的對象在它所在的span的gcmarkBits中對應的bit爲1, 而且對象已經從標記隊列中取出並處理.
gc完成後, gcmarkBits會移動到allocBits而後從新分配一個所有爲0的bitmap, 這樣黑色的對象就變爲了白色.

寫屏障(Write Barrier)

由於go支持並行GC, GC的掃描和go代碼能夠同時運行, 這樣帶來的問題是GC掃描的過程當中go代碼有可能改變了對象的依賴樹,
例如開始掃描時發現根對象A和B, B擁有C的指針, GC先掃描A, 而後B把C的指針交給A, GC再掃描B, 這時C就不會被掃描到.
爲了不這個問題, go在GC的標記階段會啓用寫屏障(Write Barrier).

啓用了寫屏障(Write Barrier)後, 當B把C的指針交給A時, GC會認爲在這一輪的掃描中C的指針是存活的,
即便A可能會在稍後丟掉C, 那麼C就在下一輪迴收.
寫屏障只針對指針啓用, 並且只在GC的標記階段啓用, 平時會直接把值寫入到目標地址.

go在1.9開始啓用了混合寫屏障(Hybrid Write Barrier), 僞代碼以下:

writePointer(slot, ptr):
    shade(*slot)
    if any stack is grey:
        shade(ptr)
    *slot = ptr

混合寫屏障會同時標記指針寫入目標的"原指針"和「新指針".

標記原指針的緣由是, 其餘運行中的線程有可能會同時把這個指針的值複製到寄存器或者棧上的本地變量,
由於複製指針到寄存器或者棧上的本地變量不會通過寫屏障, 因此有可能會致使指針不被標記, 試想下面的狀況:

[go] b = obj
[go] oldx = nil
[gc] scan oldx...
[go] oldx = b.x // 複製b.x到本地變量, 不通過寫屏障
[go] b.x = ptr // 寫屏障應該標記b.x的原值
[gc] scan b...
若是寫屏障不標記原值, 那麼oldx就不會被掃描到.

　　標記新指針的緣由是, 其餘運行中的線程有可能會轉移指針的位置, 試想下面的狀況:

[go] a = ptr
[go] b = obj
[gc] scan b...
[go] b.x = a // 寫屏障應該標記b.x的新值
[go] a = nil
[gc] scan a...
若是寫屏障不標記新值, 那麼ptr就不會被掃描到.

　　混合寫屏障可讓GC在並行標記結束後不須要從新掃描各個G的堆棧, 能夠減小Mark Termination中的STW時間.
除了寫屏障外, 在GC的過程當中全部新分配的對象都會馬上變爲黑色, 在上面的mallocgc函數中能夠看到.

輔助GC(mutator assist)

爲了防止heap增速太快, 在GC執行的過程當中若是同時運行的G分配了內存, 那麼這個G會被要求輔助GC作一部分的工做.
在GC的過程當中同時運行的G稱爲"mutator", "mutator assist"機制就是G輔助GC作一部分工做的機制.

輔助GC作的工做有兩種類型, 一種是標記(Mark), 另外一種是清掃(Sweep).
輔助標記的觸發能夠查看上面的mallocgc函數, 觸發時G會幫助掃描"工做量"個對象, 工做量的計算公式是:

debtBytes * assistWorkPerByte

　　意思是分配的大小乘以係數assistWorkPerByte, assistWorkPerByte的計算在函數revise中, 公式是:

// 等待掃描的對象數量 = 未掃描的對象數量 - 已掃描的對象數量
scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
if scanWorkExpected < 1000 {
    scanWorkExpected = 1000
}
// 距離觸發GC的Heap大小 = 期待觸發GC的Heap大小 - 當前的Heap大小
// 注意next_gc的計算跟gc_trigger不同, next_gc等於heap_marked * (1 + gcpercent / 100)
heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
if heapDistance <= 0 {
    heapDistance = 1
}
// 每分配1 byte須要輔助掃描的對象數量 = 等待掃描的對象數量 / 距離觸發GC的Heap大小
c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)

　　和輔助標記不同的是, 輔助清掃申請新span時纔會檢查, 而輔助標記是每次分配對象時都會檢查.
輔助清掃的觸發能夠看上面的cacheSpan函數, 觸發時G會幫助回收"工做量"頁的對象, 工做量的計算公式是:

spanBytes * sweepPagesPerByte // 不徹底相同, 具體看deductSweepCredit函數

　　意思是分配的大小乘以係數sweepPagesPerByte, sweepPagesPerByte的計算在函數gcSetTriggerRatio中, 公式是:

// 當前的Heap大小
heapLiveBasis := atomic.Load64(&memstats.heap_live)
// 距離觸發GC的Heap大小 = 下次觸發GC的Heap大小 - 當前的Heap大小
heapDistance := int64(trigger) - int64(heapLiveBasis)
heapDistance -= 1024 * 1024
if heapDistance < _PageSize {
    heapDistance = _PageSize
}
// 已清掃的頁數
pagesSwept := atomic.Load64(&mheap_.pagesSwept)
// 未清掃的頁數 = 使用中的頁數 - 已清掃的頁數
sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept)
if sweepDistancePages <= 0 {
    mheap_.sweepPagesPerByte = 0
} else {
    // 每分配1 byte(的span)須要輔助清掃的頁數 = 未清掃的頁數 / 距離觸發GC的Heap大小
    mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
}

根對象

在GC的標記階段首先須要標記的就是"根對象", 從根對象開始可到達的全部對象都會被認爲是存活的.
根對象包含了全局變量, 各個G的棧上的變量等, GC會先掃描根對象而後再掃描根對象可到達的全部對象.
掃描根對象包含了一系列的工做, 它們定義在[https://github.com/golang/go/blob/go1.9.2/src/runtime/mgcmark.go#L54]函數:

Fixed Roots: 特殊的掃描工做
- fixedRootFinalizers: 掃描析構器隊列
- fixedRootFreeGStacks: 釋放已停止的G的棧
Flush Cache Roots: 釋放mcache中的全部span, 要求STW
Data Roots: 掃描可讀寫的全局變量
BSS Roots: 掃描只讀的全局變量
Span Roots: 掃描各個span中特殊對象(析構器列表)
Stack Roots: 掃描各個G的棧

標記階段(Mark)會作其中的"Fixed Roots", "Data Roots", "BSS Roots", "Span Roots", "Stack Roots".
完成標記階段(Mark Termination)會作其中的"Fixed Roots", "Flush Cache Roots".

標記隊列

GC的標記階段會使用"標記隊列"來肯定全部可從根對象到達的對象都已標記, 上面提到的"灰色"的對象就是在標記隊列中的對象.
舉例來講, 若是當前有[A, B, C]這三個根對象, 那麼掃描根對象時就會把它們放到標記隊列:

work queue: [A, B, C]

後臺標記任務從標記隊列中取出A, 若是A引用了D, 則把D放入標記隊列:

work queue: [B, C, D]

後臺標記任務從標記隊列取出B, 若是B也引用了D, 這時由於D在gcmarkBits中對應的bit已是1因此會跳過:

work queue: [C, D]

若是並行運行的go代碼分配了一個對象E, 對象E會被馬上標記, 但不會進入標記隊列(由於肯定E沒有引用其餘對象).
而後並行運行的go代碼把對象F設置給對象E的成員, 寫屏障會標記對象F而後把對象F加到運行隊列:

work queue: [C, D, F]

後臺標記任務從標記隊列取出C, 若是C沒有引用其餘對象, 則不須要處理:

work queue: [D, F]

後臺標記任務從標記隊列取出D, 若是D引用了X, 則把X放入標記隊列:

work queue: [F, X]

後臺標記任務從標記隊列取出F, 若是F沒有引用其餘對象, 則不須要處理.
後臺標記任務從標記隊列取出X, 若是X沒有引用其餘對象, 則不須要處理.
最後標記隊列爲空, 標記完成, 存活的對象有[A, B, C, D, E, F, X].

實際的情況會比上面介紹的情況稍微複雜一點.
標記隊列會分爲全局標記隊列和各個P的本地標記隊列, 這點和協程中的運行隊列類似.
而且標記隊列爲空之後, 還須要中止整個世界並禁止寫屏障, 而後再次檢查是否爲空.

源代碼分析

go觸發gc會從gcStart函數開始:

// gcStart transitions the GC from _GCoff to _GCmark (if
// !mode.stwMark) or _GCmarktermination (if mode.stwMark) by
// performing sweep termination and GC initialization.
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(mode gcMode, trigger gcTrigger) {
    // 判斷當前G是否可搶佔, 不可搶佔時不觸發GC
    // Since this is called from malloc and malloc is called in
    // the guts of a number of libraries that might be holding
    // locks, don't attempt to start GC in non-preemptible or
    // potentially unstable situations.
    mp := acquirem()
    if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
        releasem(mp)
        return
    }
    releasem(mp)
    mp = nil
 
    // 並行清掃上一輪GC未清掃的span
    // Pick up the remaining unswept/not being swept spans concurrently
    //
    // This shouldn't happen if we're being invoked in background
    // mode since proportional sweep should have just finished
    // sweeping everything, but rounding errors, etc, may leave a
    // few spans unswept. In forced mode, this is necessary since
    // GC can be forced at any point in the sweeping cycle.
    //
    // We check the transition condition continuously here in case
    // this G gets delayed in to the next GC cycle.
    for trigger.test() && gosweepone() != ^uintptr(0) {
        sweep.nbgsweep++
    }
 
    // 上鎖, 而後從新檢查gcTrigger的條件是否成立, 不成立時不觸發GC
    // Perform GC initialization and the sweep termination
    // transition.
    semacquire(&work.startSema)
    // Re-check transition condition under transition lock.
    if !trigger.test() {
        semrelease(&work.startSema)
        return
    }
 
    // 記錄是否強制觸發, gcTriggerCycle是runtime.GC用的
    // For stats, check if this GC was forced by the user.
    work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle
 
    // 判斷是否指定了禁止並行GC的參數
    // In gcstoptheworld debug mode, upgrade the mode accordingly.
    // We do this after re-checking the transition condition so
    // that multiple goroutines that detect the heap trigger don't
    // start multiple STW GCs.
    if mode == gcBackgroundMode {
        if debug.gcstoptheworld == 1 {
            mode = gcForceMode
        } else if debug.gcstoptheworld == 2 {
            mode = gcForceBlockMode
        }
    }
 
    // Ok, we're doing it!  Stop everybody else
    semacquire(&worldsema)
 
    // 跟蹤處理
    if trace.enabled {
        traceGCStart()
    }
 
    // 啓動後臺掃描任務(G)
    if mode == gcBackgroundMode {
        gcBgMarkStartWorkers()
    }
 
    // 重置標記相關的狀態
    gcResetMarkState()
 
    // 重置參數
    work.stwprocs, work.maxprocs = gcprocs(), gomaxprocs
    work.heap0 = atomic.Load64(&memstats.heap_live)
    work.pauseNS = 0
    work.mode = mode
 
    // 記錄開始時間
    now := nanotime()
    work.tSweepTerm = now
    work.pauseStart = now
    
    // 中止全部運行中的G, 並禁止它們運行
    systemstack(stopTheWorldWithSema)
    
    // !!!!!!!!!!!!!!!!
    // 世界已中止(STW)...
    // !!!!!!!!!!!!!!!!
    
    // 清掃上一輪GC未清掃的span, 確保上一輪GC已完成
    // Finish sweep before we start concurrent scan.
    systemstack(func() {
        finishsweep_m()
    })
    // 清掃sched.sudogcache和sched.deferpool
    // clearpools before we start the GC. If we wait they memory will not be
    // reclaimed until the next GC cycle.
    clearpools()
 
    // 增長GC計數
    work.cycles++
    
    // 判斷是否並行GC模式
    if mode == gcBackgroundMode { // Do as much work concurrently as possible
        // 標記新一輪GC已開始
        gcController.startCycle()
        work.heapGoal = memstats.next_gc
 
        // 設置全局變量中的GC狀態爲_GCmark
        // 而後啓用寫屏障
        // Enter concurrent mark phase and enable
        // write barriers.
        //
        // Because the world is stopped, all Ps will
        // observe that write barriers are enabled by
        // the time we start the world and begin
        // scanning.
        //
        // Write barriers must be enabled before assists are
        // enabled because they must be enabled before
        // any non-leaf heap objects are marked. Since
        // allocations are blocked until assists can
        // happen, we want enable assists as early as
        // possible.
        setGCPhase(_GCmark)
 
        // 重置後臺標記任務的計數
        gcBgMarkPrepare() // Must happen before assist enable.
 
        // 計算掃描根對象的任務數量
        gcMarkRootPrepare()
 
        // 標記全部tiny alloc等待合併的對象
        // Mark all active tinyalloc blocks. Since we're
        // allocating from these, they need to be black like
        // other allocations. The alternative is to blacken
        // the tiny block on every allocation from it, which
        // would slow down the tiny allocator.
        gcMarkTinyAllocs()
 
        // 啓用輔助GC
        // At this point all Ps have enabled the write
        // barrier, thus maintaining the no white to
        // black invariant. Enable mutator assists to
        // put back-pressure on fast allocating
        // mutators.
        atomic.Store(&gcBlackenEnabled, 1)
 
        // 記錄標記開始的時間
        // Assists and workers can start the moment we start
        // the world.
        gcController.markStartTime = now
 
        // 從新啓動世界
        // 前面建立的後臺標記任務會開始工做, 全部後臺標記任務都完成工做後, 進入完成標記階段
        // Concurrent mark.
        systemstack(startTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!
        // 世界已從新啓動...
        // !!!!!!!!!!!!!!!
        
        // 記錄中止了多久, 和標記階段開始的時間
        now = nanotime()
        work.pauseNS += now - work.pauseStart
        work.tMark = now
    } else {
        // 不是並行GC模式
        // 記錄完成標記階段開始的時間
        t := nanotime()
        work.tMark, work.tMarkTerm = t, t
        work.heapGoal = work.heap0
 
        // 跳過標記階段, 執行完成標記階段
        // 全部標記工做都會在世界已中止的狀態執行
        // (標記階段會設置work.markrootDone=true, 若是跳過則它的值是false, 完成標記階段會執行全部工做)
        // 完成標記階段會從新啓動世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(memstats.triggerRatio)
    }
 
    semrelease(&work.startSema)
}

接下來一個個分析gcStart調用的函數, 建議配合上面的"回收對象的流程"中的圖理解.

函數gcBgMarkStartWorkers用於啓動後臺標記任務, 先分別對每一個P啓動一個:

// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
    // Background marking is performed by per-P G's. Ensure that
    // each P has a background GC G.
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {
            break
        }
        // 若是已啓動則不重複啓動
        if p.gcBgMarkWorker == 0 {
            go gcBgMarkWorker(p)
            // 啓動後等待該任務通知信號量bgMarkReady再繼續
            notetsleepg(&work.bgMarkReady, -1)
            noteclear(&work.bgMarkReady)
        }
    }
}

這裏雖然爲每一個P啓動了一個後臺標記任務, 可是能夠同時工做的只有25%, 這個邏輯在協程M獲取G時調用的findRunnableGCWorker中:

// findRunnableGCWorker returns the background mark worker for _p_ if it
// should be run. This must only be called when gcBlackenEnabled != 0.
func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
    if gcBlackenEnabled == 0 {
        throw("gcControllerState.findRunnable: blackening not enabled")
    }
    if _p_.gcBgMarkWorker == 0 {
        // The mark worker associated with this P is blocked
        // performing a mark transition. We can't run it
        // because it may be on some other run or wait queue.
        return nil
    }
 
    if !gcMarkWorkAvailable(_p_) {
        // No work to be done right now. This can happen at
        // the end of the mark phase when there are still
        // assists tapering off. Don't bother running a worker
        // now because it'll just return immediately.
        return nil
    }
 
    // 原子減小對應的值, 若是減小後大於等於0則返回true, 不然返回false
    decIfPositive := func(ptr *int64) bool {
        if *ptr > 0 {
            if atomic.Xaddint64(ptr, -1) >= 0 {
                return true
            }
            // We lost a race
            atomic.Xaddint64(ptr, +1)
        }
        return false
    }
 
    // 減小dedicatedMarkWorkersNeeded, 成功時後臺標記任務的模式是Dedicated
    // dedicatedMarkWorkersNeeded是當前P的數量的25%去除小數點
    // 詳見startCycle函數
    if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
        // This P is now dedicated to marking until the end of
        // the concurrent mark phase.
        _p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
    } else {
        // 減小fractionalMarkWorkersNeeded, 成功是後臺標記任務的模式是Fractional
        // 上面的計算若是小數點後有數值(不可以整除)則fractionalMarkWorkersNeeded爲1, 不然爲0
        // 詳見startCycle函數
        // 舉例來講, 4個P時會執行1個Dedicated模式的任務, 5個P時會執行1個Dedicated模式和1個Fractional模式的任務
        if !decIfPositive(&c.fractionalMarkWorkersNeeded) {
            // No more workers are need right now.
            return nil
        }
 
        // 按Dedicated模式的任務的執行時間判斷cpu佔用率是否超過預算值, 超過期不啓動
        // This P has picked the token for the fractional worker.
        // Is the GC currently under or at the utilization goal?
        // If so, do more work.
        //
        // We used to check whether doing one time slice of work
        // would remain under the utilization goal, but that has the
        // effect of delaying work until the mutator has run for
        // enough time slices to pay for the work. During those time
        // slices, write barriers are enabled, so the mutator is running slower.
        // Now instead we do the work whenever we're under or at the
        // utilization work and pay for it by letting the mutator run later.
        // This doesn't change the overall utilization averages, but it
        // front loads the GC work so that the GC finishes earlier and
        // write barriers can be turned off sooner, effectively giving
        // the mutator a faster machine.
        //
        // The old, slower behavior can be restored by setting
        //  gcForcePreemptNS = forcePreemptNS.
        const gcForcePreemptNS = 0
 
        // TODO(austin): We could fast path this and basically
        // eliminate contention on c.fractionalMarkWorkersNeeded by
        // precomputing the minimum time at which it's worth
        // next scheduling the fractional worker. Then Ps
        // don't have to fight in the window where we've
        // passed that deadline and no one has started the
        // worker yet.
        //
        // TODO(austin): Shorter preemption interval for mark
        // worker to improve fairness and give this
        // finer-grained control over schedule?
        now := nanotime() - gcController.markStartTime
        then := now + gcForcePreemptNS
        timeUsed := c.fractionalMarkTime + gcForcePreemptNS
        if then > 0 && float64(timeUsed)/float64(then) > c.fractionalUtilizationGoal {
            // Nope, we'd overshoot the utilization goal
            atomic.Xaddint64(&c.fractionalMarkWorkersNeeded, +1)
            return nil
        }
        _p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
    }
 
    // 安排後臺標記任務執行
    // Run the background mark worker
    gp := _p_.gcBgMarkWorker.ptr()
    casgstatus(gp, _Gwaiting, _Grunnable)
    if trace.enabled {
        traceGoUnpark(gp, 0)
    }
    return gp
}

　　gcResetMarkState函數會重置標記相關的狀態:

// gcResetMarkState resets global state prior to marking (concurrent
// or STW) and resets the stack scan state of all Gs.
//
// This is safe to do without the world stopped because any Gs created
// during or after this will start out in the reset state.
func gcResetMarkState() {
    // This may be called during a concurrent phase, so make sure
    // allgs doesn't change.
    lock(&allglock)
    for _, gp := range allgs {
        gp.gcscandone = false  // set to true in gcphasework
        gp.gcscanvalid = false // stack has not been scanned
        gp.gcAssistBytes = 0
    }
    unlock(&allglock)
 
    work.bytesMarked = 0
    work.initialHeapLive = atomic.Load64(&memstats.heap_live)
    work.markrootDone = false
}

　　stopTheWorldWithSema函數會中止整個世界, 這個函數必須在g0中運行:

// stopTheWorldWithSema is the core implementation of stopTheWorld.
// The caller is responsible for acquiring worldsema and disabling
// preemption first and then should stopTheWorldWithSema on the system
// stack:
//
//  semacquire(&worldsema, 0)
//  m.preemptoff = "reason"
//  systemstack(stopTheWorldWithSema)
//
// When finished, the caller must either call startTheWorld or undo
// these three operations separately:
//
//  m.preemptoff = ""
//  systemstack(startTheWorldWithSema)
//  semrelease(&worldsema)
//
// It is allowed to acquire worldsema once and then execute multiple
// startTheWorldWithSema/stopTheWorldWithSema pairs.
// Other P's are able to execute between successive calls to
// startTheWorldWithSema and stopTheWorldWithSema.
// Holding worldsema causes any other goroutines invoking
// stopTheWorld to block.
func stopTheWorldWithSema() {
    _g_ := getg()
 
    // If we hold a lock, then we won't be able to stop another M
    // that is blocked trying to acquire the lock.
    if _g_.m.locks > 0 {
        throw("stopTheWorld: holding locks")
    }
 
    lock(&sched.lock)
    
    // 須要中止的P數量
    sched.stopwait = gomaxprocs
    
    // 設置gc等待標記, 調度時看見此標記會進入等待
    atomic.Store(&sched.gcwaiting, 1)
    
    // 搶佔全部運行中的G
    preemptall()
    
    // 中止當前的P
    // stop current P
    _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
    
    // 減小須要中止的P數量(當前的P算一個)
    sched.stopwait--
    
    // 搶佔全部在Psyscall狀態的P, 防止它們從新參與調度
    // try to retake all P's in Psyscall status
    for i := 0; i < int(gomaxprocs); i++ {
        p := allp[i]
        s := p.status
        if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
            if trace.enabled {
                traceGoSysBlock(p)
                traceProcStop(p)
            }
            p.syscalltick++
            sched.stopwait--
        }
    }
    
    // 防止全部空閒的P從新參與調度
    // stop idle P's
    for {
        p := pidleget()
        if p == nil {
            break
        }
        p.status = _Pgcstop
        sched.stopwait--
    }
    wait := sched.stopwait > 0
    unlock(&sched.lock)
 
    // 若是仍有須要中止的P, 則等待它們中止
    // wait for remaining P's to stop voluntarily
    if wait {
        for {
            // 循環等待 + 搶佔全部運行中的G
            // wait for 100us, then try to re-preempt in case of any races
            if notetsleep(&sched.stopnote, 100*1000) {
                noteclear(&sched.stopnote)
                break
            }
            preemptall()
        }
    }
 
    // 邏輯正確性檢查
    // sanity checks
    bad := ""
    if sched.stopwait != 0 {
        bad = "stopTheWorld: not stopped (stopwait != 0)"
    } else {
        for i := 0; i < int(gomaxprocs); i++ {
            p := allp[i]
            if p.status != _Pgcstop {
                bad = "stopTheWorld: not stopped (status != _Pgcstop)"
            }
        }
    }
    if atomic.Load(&freezing) != 0 {
        // Some other thread is panicking. This can cause the
        // sanity checks above to fail if the panic happens in
        // the signal handler on a stopped thread. Either way,
        // we should halt this thread.
        lock(&deadlock)
        lock(&deadlock)
    }
    if bad != "" {
        throw(bad)
    }
    
    // 到這裏全部運行中的G都會變爲待運行, 而且全部的P都不能被M獲取
    // 也就是說全部的go代碼(除了當前的)都會中止運行, 而且不能運行新的go代碼
}

　　finishsweep_m函數會清掃上一輪GC未清掃的span, 確保上一輪GC已完成:

// finishsweep_m ensures that all spans are swept.
//
// The world must be stopped. This ensures there are no sweeps in
// progress.
//
//go:nowritebarrier
func finishsweep_m() {
    // sweepone會取出一個未sweep的span而後執行sweep
    // 詳細將在下面sweep階段時分析
    // Sweeping must be complete before marking commences, so
    // sweep any unswept spans. If this is a concurrent GC, there
    // shouldn't be any spans left to sweep, so this should finish
    // instantly. If GC was forced before the concurrent sweep
    // finished, there may be spans to sweep.
    for sweepone() != ^uintptr(0) {
        sweep.npausesweep++
    }
 
    // 全部span都sweep完成後, 啓動一個新的markbit時代
    // 這個函數是實現span的gcmarkBits和allocBits的分配和複用的關鍵, 流程以下
    // - span分配gcmarkBits和allocBits
    // - span完成sweep
    //   - 原allocBits再也不被使用
    //   - gcmarkBits變爲allocBits
    //   - 分配新的gcmarkBits
    // - 開啓新的markbit時代
    // - span完成sweep, 同上
    // - 開啓新的markbit時代
    //   - 2個時代以前的bitmap將再也不被使用, 能夠複用這些bitmap
    nextMarkBitArenaEpoch()
}

　　clearpools函數會清理sched.sudogcache和sched.deferpool, 讓它們的內存能夠被回收:

func clearpools() {
    // clear sync.Pools
    if poolcleanup != nil {
        poolcleanup()
    }
 
    // Clear central sudog cache.
    // Leave per-P caches alone, they have strictly bounded size.
    // Disconnect cached list before dropping it on the floor,
    // so that a dangling ref to one entry does not pin all of them.
    lock(&sched.sudoglock)
    var sg, sgnext *sudog
    for sg = sched.sudogcache; sg != nil; sg = sgnext {
        sgnext = sg.next
        sg.next = nil
    }
    sched.sudogcache = nil
    unlock(&sched.sudoglock)
 
    // Clear central defer pools.
    // Leave per-P pools alone, they have strictly bounded size.
    lock(&sched.deferlock)
    for i := range sched.deferpool {
        // disconnect cached list before dropping it on the floor,
        // so that a dangling ref to one entry does not pin all of them.
        var d, dlink *_defer
        for d = sched.deferpool[i]; d != nil; d = dlink {
            dlink = d.link
            d.link = nil
        }
        sched.deferpool[i] = nil
    }
    unlock(&sched.deferlock)
}

　　startCycle標記開始了新一輪的GC:

// startCycle resets the GC controller's state and computes estimates
// for a new GC cycle. The caller must hold worldsema.
func (c *gcControllerState) startCycle() {
    c.scanWork = 0
    c.bgScanCredit = 0
    c.assistTime = 0
    c.dedicatedMarkTime = 0
    c.fractionalMarkTime = 0
    c.idleMarkTime = 0
 
    // 假裝heap_marked的值若是gc_trigger的值很小, 防止後面對triggerRatio作出錯誤的調整
    // If this is the first GC cycle or we're operating on a very
    // small heap, fake heap_marked so it looks like gc_trigger is
    // the appropriate growth from heap_marked, even though the
    // real heap_marked may not have a meaningful value (on the
    // first cycle) or may be much smaller (resulting in a large
    // error response).
    if memstats.gc_trigger <= heapminimum {
        memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio))
    }
 
    // 從新計算next_gc, 注意next_gc的計算跟gc_trigger不同
    // Re-compute the heap goal for this cycle in case something
    // changed. This is the same calculation we use elsewhere.
    memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
    if gcpercent < 0 {
        memstats.next_gc = ^uint64(0)
    }
 
    // 確保next_gc和heap_live之間最少有1MB
    // Ensure that the heap goal is at least a little larger than
    // the current live heap size. This may not be the case if GC
    // start is delayed or if the allocation that pushed heap_live
    // over gc_trigger is large or if the trigger is really close to
    // GOGC. Assist is proportional to this distance, so enforce a
    // minimum distance, even if it means going over the GOGC goal
    // by a tiny bit.
    if memstats.next_gc < memstats.heap_live+1024*1024 {
        memstats.next_gc = memstats.heap_live + 1024*1024
    }
 
    // 計算能夠同時執行的後臺標記任務的數量
    // dedicatedMarkWorkersNeeded等於P的數量的25%去除小數點
    // 若是能夠整除則fractionalMarkWorkersNeeded等於0不然等於1
    // totalUtilizationGoal是GC所佔的P的目標值(例如P一共有5個時目標是1.25個P)
    // fractionalUtilizationGoal是Fractiona模式的任務所佔的P的目標值(例如P一共有5個時目標是0.25個P)
    // Compute the total mark utilization goal and divide it among
    // dedicated and fractional workers.
    totalUtilizationGoal := float64(gomaxprocs) * gcGoalUtilization
    c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal)
    c.fractionalUtilizationGoal = totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)
    if c.fractionalUtilizationGoal > 0 {
        c.fractionalMarkWorkersNeeded = 1
    } else {
        c.fractionalMarkWorkersNeeded = 0
    }
 
    // 重置P中的輔助GC所用的時間統計
    // Clear per-P state
    for _, p := range &allp {
        if p == nil {
            break
        }
        p.gcAssistTime = 0
    }
 
    // 計算輔助GC的參數
    // 參考上面對計算assistWorkPerByte的公式的分析
    // Compute initial values for controls that are updated
    // throughout the cycle.
    c.revise()
 
    if debug.gcpacertrace > 0 {
        print("pacer: assist ratio=", c.assistWorkPerByte,
            " (scan ", memstats.heap_scan>>20, " MB in ",
            work.initialHeapLive>>20, "->",
            memstats.next_gc>>20, " MB)",
            " workers=", c.dedicatedMarkWorkersNeeded,
            "+", c.fractionalMarkWorkersNeeded, "\n")
    }
}

　　setGCPhase函數會修改表示當前GC階段的全局變量和是否開啓寫屏障的全局變量:

//go:nosplit
func setGCPhase(x uint32) {
    atomic.Store(&gcphase, x)
    writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
    writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
}

　　cBgMarkPrepare函數會重置後臺標記任務的計數:

// gcBgMarkPrepare sets up state for background marking.
// Mutator assists must not yet be enabled.
func gcBgMarkPrepare() {
    // Background marking will stop when the work queues are empty
    // and there are no more workers (note that, since this is
    // concurrent, this may be a transient state, but mark
    // termination will clean it up). Between background workers
    // and assists, we don't really know how many workers there
    // will be, so we pretend to have an arbitrarily large number
    // of workers, almost all of which are "waiting". While a
    // worker is working it decrements nwait. If nproc == nwait,
    // there are no workers.
    work.nproc = ^uint32(0)
    work.nwait = ^uint32(0)
}

　　gcMarkRootPrepare函數會計算掃描根對象的任務數量:

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
// some miscellany) and initializes scanning-related state.
//
// The caller must have call gcCopySpans().
//
// The world must be stopped.
//
//go:nowritebarrier
func gcMarkRootPrepare() {
    // 釋放mcache中的全部span的任務, 只在完成標記階段(mark termination)中執行
    if gcphase == _GCmarktermination {
        work.nFlushCacheRoots = int(gomaxprocs)
    } else {
        work.nFlushCacheRoots = 0
    }
 
    // 計算block數量的函數, rootBlockBytes是256KB
    // Compute how many data and BSS root blocks there are.
    nBlocks := func(bytes uintptr) int {
        return int((bytes + rootBlockBytes - 1) / rootBlockBytes)
    }
 
    work.nDataRoots = 0
    work.nBSSRoots = 0
 
    // data和bss每一輪GC只掃描一次
    // 並行GC中會在後臺標記任務中掃描, 完成標記階段(mark termination)中不掃描
    // 非並行GC會在完成標記階段(mark termination)中掃描
    // Only scan globals once per cycle; preferably concurrently.
    if !work.markrootDone {
        // 計算掃描可讀寫的全局變量的任務數量
        for _, datap := range activeModules() {
            nDataRoots := nBlocks(datap.edata - datap.data)
            if nDataRoots > work.nDataRoots {
                work.nDataRoots = nDataRoots
            }
        }
 
        // 計算掃描只讀的全局變量的任務數量
        for _, datap := range activeModules() {
            nBSSRoots := nBlocks(datap.ebss - datap.bss)
            if nBSSRoots > work.nBSSRoots {
                work.nBSSRoots = nBSSRoots
            }
        }
    }
 
    // span中的finalizer和各個G的棧每一輪GC只掃描一次
    // 同上
    if !work.markrootDone {
        // 計算掃描span中的finalizer的任務數量
        // On the first markroot, we need to scan span roots.
        // In concurrent GC, this happens during concurrent
        // mark and we depend on addfinalizer to ensure the
        // above invariants for objects that get finalizers
        // after concurrent mark. In STW GC, this will happen
        // during mark termination.
        //
        // We're only interested in scanning the in-use spans,
        // which will all be swept at this point. More spans
        // may be added to this list during concurrent GC, but
        // we only care about spans that were allocated before
        // this mark phase.
        work.nSpanRoots = mheap_.sweepSpans[mheap_.sweepgen/2%2].numBlocks()
 
        // 計算掃描各個G的棧的任務數量
        // On the first markroot, we need to scan all Gs. Gs
        // may be created after this point, but it's okay that
        // we ignore them because they begin life without any
        // roots, so there's nothing to scan, and any roots
        // they create during the concurrent phase will be
        // scanned during mark termination. During mark
        // termination, allglen isn't changing, so we'll scan
        // all Gs.
        work.nStackRoots = int(atomic.Loaduintptr(&allglen))
    } else {
        // We've already scanned span roots and kept the scan
        // up-to-date during concurrent mark.
        work.nSpanRoots = 0
 
        // The hybrid barrier ensures that stacks can't
        // contain pointers to unmarked objects, so on the
        // second markroot, there's no need to scan stacks.
        work.nStackRoots = 0
 
        if debug.gcrescanstacks > 0 {
            // Scan stacks anyway for debugging.
            work.nStackRoots = int(atomic.Loaduintptr(&allglen))
        }
    }
 
    // 計算總任務數量
    // 後臺標記任務會對markrootNext進行原子遞增, 來決定作哪一個任務
    // 這種用數值來實現鎖自由隊列的辦法挺聰明的, 儘管google工程師以爲很差(看後面markroot函數的分析)
    work.markrootNext = 0
    work.markrootJobs = uint32(fixedRootCount + work.nFlushCacheRoots + work.nDataRoots + work.nBSSRoots + work.nSpanRoots + work.nStackRoots)
}

　　gcMarkTinyAllocs函數會標記全部tiny alloc等待合併的對象:

// gcMarkTinyAllocs greys all active tiny alloc blocks.
//
// The world must be stopped.
func gcMarkTinyAllocs() {
    for _, p := range &allp {
        if p == nil || p.status == _Pdead {
            break
        }
        c := p.mcache
        if c == nil || c.tiny == 0 {
            continue
        }
        // 標記各個P中的mcache中的tiny
        // 在上面的mallocgc函數中能夠看到tiny是當前等待合併的對象
        _, hbits, span, objIndex := heapBitsForObject(c.tiny, 0, 0)
        gcw := &p.gcw
        // 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
        greyobject(c.tiny, 0, 0, hbits, span, gcw, objIndex)
        // gcBlackenPromptly變量表示當前是否禁止本地隊列, 若是已禁止則把標記任務flush到全局隊列
        if gcBlackenPromptly {
            gcw.dispose()
        }
    }
}

　　startTheWorldWithSema函數會從新啓動世界:

func startTheWorldWithSema() {
    _g_ := getg()
    
    // 禁止G被搶佔
    _g_.m.locks++        // disable preemption because it can be holding p in a local var
    
    // 判斷收到的網絡事件(fd可讀可寫或錯誤)並添加對應的G到待運行隊列
    gp := netpoll(false) // non-blocking
    injectglist(gp)
    
    // 判斷是否要啓動gc helper
    add := needaddgcproc()
    lock(&sched.lock)
    
    // 若是要求改變gomaxprocs則調整P的數量
    // procresize會返回有可運行任務的P的鏈表
    procs := gomaxprocs
    if newprocs != 0 {
        procs = newprocs
        newprocs = 0
    }
    p1 := procresize(procs)
    
    // 取消GC等待標記
    sched.gcwaiting = 0
    
    // 若是sysmon在等待則喚醒它
    if sched.sysmonwait != 0 {
        sched.sysmonwait = 0
        notewakeup(&sched.sysmonnote)
    }
    unlock(&sched.lock)
    
    // 喚醒有可運行任務的P
    for p1 != nil {
        p := p1
        p1 = p1.link.ptr()
        if p.m != 0 {
            mp := p.m.ptr()
            p.m = 0
            if mp.nextp != 0 {
                throw("startTheWorld: inconsistent mp->nextp")
            }
            mp.nextp.set(p)
            notewakeup(&mp.park)
        } else {
            // Start M to run P.  Do not start another M below.
            newm(nil, p)
            add = false
        }
    }
    
    // 若是有空閒的P，而且沒有自旋中的M則喚醒或者建立一個M
    // Wakeup an additional proc in case we have excessive runnable goroutines
    // in local queues or in the global queue. If we don't, the proc will park itself.
    // If we have lots of excessive work, resetspinning will unpark additional procs as necessary.
    if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
        wakep()
    }
    
    // 啓動gc helper
    if add {
        // If GC could have used another helper proc, start one now,
        // in the hope that it will be available next time.
        // It would have been even better to start it before the collection,
        // but doing so requires allocating memory, so it's tricky to
        // coordinate. This lazy approach works out in practice:
        // we don't mind if the first couple gc rounds don't have quite
        // the maximum number of procs.
        newm(mhelpgc, nil)
    }
    
    // 容許G被搶佔
    _g_.m.locks--
    
    // 若是當前G要求被搶佔則從新嘗試
    if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
        _g_.stackguard0 = stackPreempt
    }
}

　　重啓世界後各個M會從新開始調度, 調度時會優先使用上面提到的findRunnableGCWorker函數查找任務, 以後就有大約25%的P運行後臺標記任務.
後臺標記任務的函數是gcBgMarkWorker:

func gcBgMarkWorker(_p_ *p) {
    gp := getg()
    
    // 用於休眠後從新獲取P的構造體
    type parkInfo struct {
        m      muintptr // Release this m on park.
        attach puintptr // If non-nil, attach to this p on park.
    }
    // We pass park to a gopark unlock function, so it can't be on
    // the stack (see gopark). Prevent deadlock from recursively
    // starting GC by disabling preemption.
    gp.m.preemptoff = "GC worker init"
    park := new(parkInfo)
    gp.m.preemptoff = ""
    
    // 設置當前的M並禁止搶佔
    park.m.set(acquirem())
    // 設置當前的P(須要關聯到的P)
    park.attach.set(_p_)
    
    // 通知gcBgMarkStartWorkers能夠繼續處理
    // Inform gcBgMarkStartWorkers that this worker is ready.
    // After this point, the background mark worker is scheduled
    // cooperatively by gcController.findRunnable. Hence, it must
    // never be preempted, as this would put it into _Grunnable
    // and put it on a run queue. Instead, when the preempt flag
    // is set, this puts itself into _Gwaiting to be woken up by
    // gcController.findRunnable at the appropriate time.
    notewakeup(&work.bgMarkReady)
    
    for {
        // 讓當前G進入休眠
        // Go to sleep until woken by gcController.findRunnable.
        // We can't releasem yet since even the call to gopark
        // may be preempted.
        gopark(func(g *g, parkp unsafe.Pointer) bool {
            park := (*parkInfo)(parkp)
            
            // 從新容許搶佔
            // The worker G is no longer running, so it's
            // now safe to allow preemption.
            releasem(park.m.ptr())
            
            // 設置關聯的P
            // 把當前的G設到P的gcBgMarkWorker成員, 下次findRunnableGCWorker會使用
            // 設置失敗時不休眠
            // If the worker isn't attached to its P,
            // attach now. During initialization and after
            // a phase change, the worker may have been
            // running on a different P. As soon as we
            // attach, the owner P may schedule the
            // worker, so this must be done after the G is
            // stopped.
            if park.attach != 0 {
                p := park.attach.ptr()
                park.attach.set(nil)
                // cas the worker because we may be
                // racing with a new worker starting
                // on this P.
                if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
                    // The P got a new worker.
                    // Exit this worker.
                    return false
                }
            }
            return true
        }, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0)
        
        // 檢查P的gcBgMarkWorker是否和當前的G一致, 不一致時結束當前的任務
        // Loop until the P dies and disassociates this
        // worker (the P may later be reused, in which case
        // it will get a new worker) or we failed to associate.
        if _p_.gcBgMarkWorker.ptr() != gp {
            break
        }
        
        // 禁止G被搶佔
        // Disable preemption so we can use the gcw. If the
        // scheduler wants to preempt us, we'll stop draining,
        // dispose the gcw, and then preempt.
        park.m.set(acquirem())
        
        if gcBlackenEnabled == 0 {
            throw("gcBgMarkWorker: blackening not enabled")
        }
        
        // 記錄開始時間
        startTime := nanotime()
        
        decnwait := atomic.Xadd(&work.nwait, -1)
        if decnwait == work.nproc {
            println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
            throw("work.nwait was > work.nproc")
        }
        
        // 切換到g0運行
        systemstack(func() {
            // 設置G的狀態爲等待中這樣它的棧能夠被掃描(兩個後臺標記任務能夠互相掃描對方的棧)
            // Mark our goroutine preemptible so its stack
            // can be scanned. This lets two mark workers
            // scan each other (otherwise, they would
            // deadlock). We must not modify anything on
            // the G stack. However, stack shrinking is
            // disabled for mark workers, so it is safe to
            // read from the G stack.
            casgstatus(gp, _Grunning, _Gwaiting)
            
            // 判斷後臺標記任務的模式
            switch _p_.gcMarkWorkerMode {
            default:
                throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
            case gcMarkWorkerDedicatedMode:
                // 這個模式下P應該專心執行標記
                // 執行標記, 直到被搶佔, 而且須要計算後臺的掃描量來減小輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
                // 被搶佔時把本地運行隊列中的全部G都踢到全局運行隊列
                if gp.preempt {
                    // We were preempted. This is
                    // a useful signal to kick
                    // everything out of the run
                    // queue so it can run
                    // somewhere else.
                    lock(&sched.lock)
                    for {
                        gp, _ := runqget(_p_)
                        if gp == nil {
                            break
                        }
                        globrunqput(gp)
                    }
                    unlock(&sched.lock)
                }
                // 繼續執行標記, 直到無更多任務, 而且須要計算後臺的掃描量來減小輔助GC和喚醒等待中的G
                // Go back to draining, this time
                // without preemption.
                gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
            case gcMarkWorkerFractionalMode:
                // 這個模式下P應該適當執行標記
                // 執行標記, 直到被搶佔, 而且須要計算後臺的掃描量來減小輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
            case gcMarkWorkerIdleMode:
                // 這個模式下P只在空閒時執行標記
                // 執行標記, 直到被搶佔或者達到必定的量, 而且須要計算後臺的掃描量來減小輔助GC和喚醒等待中的G
                gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
            }
            
            // 恢復G的狀態到運行中
            casgstatus(gp, _Gwaiting, _Grunning)
        })
        
        // 若是標記了禁止本地標記隊列則flush到全局標記隊列
        // If we are nearing the end of mark, dispose
        // of the cache promptly. We must do this
        // before signaling that we're no longer
        // working so that other workers can't observe
        // no workers and no work while we have this
        // cached, and before we compute done.
        if gcBlackenPromptly {
            _p_.gcw.dispose()
        }
        
        // 累加所用時間
        // Account for time.
        duration := nanotime() - startTime
        switch _p_.gcMarkWorkerMode {
        case gcMarkWorkerDedicatedMode:
            atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
            atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
        case gcMarkWorkerFractionalMode:
            atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
            atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 1)
        case gcMarkWorkerIdleMode:
            atomic.Xaddint64(&gcController.idleMarkTime, duration)
        }
        
        // Was this the last worker and did we run out
        // of work?
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait > work.nproc {
            println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
                "work.nwait=", incnwait, "work.nproc=", work.nproc)
            throw("work.nwait > work.nproc")
        }
        
        // 判斷是否全部後臺標記任務都完成, 而且沒有更多的任務
        // If this worker reached a background mark completion
        // point, signal the main GC goroutine.
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // 取消和P的關聯
            // Make this G preemptible and disassociate it
            // as the worker for this P so
            // findRunnableGCWorker doesn't try to
            // schedule it.
            _p_.gcBgMarkWorker.set(nil)
            
            // 容許G被搶佔
            releasem(park.m.ptr())
            
            // 準備進入完成標記階段
            gcMarkDone()
            
            // 休眠以前會從新關聯P
            // 由於上面容許被搶佔, 到這

// gcDrain scans roots and objects in work buffers, blackening grey
// objects until all roots and work buffers have been drained.
//
// If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt
// is set. This implies gcDrainNoBlock.
//
// If flags&gcDrainIdle != 0, gcDrain returns when there is other work
// to do. This implies gcDrainNoBlock.
//
// If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is
// unable to get more work. Otherwise, it will block until all
// blocking calls are blocked in gcDrain.
//
// If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work
// credit to gcController.bgScanCredit every gcCreditSlack units of
// scan work.
//
//go:nowritebarrier
func gcDrain(gcw *gcWork, flags gcDrainFlags) {
    if !writeBarrier.needed {
        throw("gcDrain phase incorrect")
    }
    
    gp := getg().m.curg
    
    // 看到搶佔標誌時是否要返回
    preemptible := flags&gcDrainUntilPreempt != 0
    
    // 沒有任務時是否要等待任務
    blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0
    
    // 是否計算後臺的掃描量來減小輔助GC和喚醒等待中的G
    flushBgCredit := flags&gcDrainFlushBgCredit != 0
    
    // 是否只執行必定量的工做
    idle := flags&gcDrainIdle != 0
    
    // 記錄初始的已掃描數量
    initScanWork := gcw.scanWork
    
    // 掃描idleCheckThreshold(100000)個對象之後檢查是否要返回
    // idleCheck is the scan work at which to perform the next
    // idle check with the scheduler.
    idleCheck := initScanWork + idleCheckThreshold
    
    // 若是根對象未掃描完, 則先掃描根對象
    // Drain root marking jobs.
    if work.markrootNext < work.markrootJobs {
        // 若是標記了preemptible, 循環直到被搶佔
        for !(preemptible && gp.preempt) {
            // 從根對象掃描隊列取出一個值(原子遞增)
            job := atomic.Xadd(&work.markrootNext, +1) - 1
            if job >= work.markrootJobs {
                break
            }
            // 執行根對象掃描工做
            markroot(gcw, job)
            // 若是是idle模式而且有其餘工做, 則返回
            if idle && pollWork() {
                goto done
            }
        }
    }
    
    // 根對象已經在標記隊列中, 消費標記隊列
    // 若是標記了preemptible, 循環直到被搶佔
    // Drain heap marking jobs.
    for !(preemptible && gp.preempt) {
        // 若是全局標記隊列爲空, 把本地標記隊列的一部分工做分過去
        // (若是wbuf2不爲空則移動wbuf2過去, 不然移動wbuf1的一半過去)
        // Try to keep work available on the global queue. We used to
        // check if there were waiting workers, but it's better to
        // just keep work available than to make workers wait. In the
        // worst case, we'll do O(log(_WorkbufSize)) unnecessary
        // balances.
        if work.full == 0 {
            gcw.balance()
        }
        
        // 從本地標記隊列中獲取對象, 獲取不到則從全局標記隊列獲取
        var b uintptr
        if blocking {
            // 阻塞獲取
            b = gcw.get()
        } else {
            // 非阻塞獲取
            b = gcw.tryGetFast()
            if b == 0 {
                b = gcw.tryGet()
            }
        }
        
        // 獲取不到對象, 標記隊列已爲空, 跳出循環
        if b == 0 {
            // work barrier reached or tryGet failed.
            break
        }
        
        // 掃描獲取到的對象
        scanobject(b, gcw)
        
        // 若是已經掃描了必定數量的對象(gcCreditSlack的值是2000)
        // Flush background scan work credit to the global
        // account if we've accumulated enough locally so
        // mutator assists can draw on it.
        if gcw.scanWork >= gcCreditSlack {
            // 把掃描的對象數量添加到全局
            atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
            // 減小輔助GC的工做量和喚醒等待中的G
            if flushBgCredit {
                gcFlushBgCredit(gcw.scanWork - initScanWork)
                initScanWork = 0
            }
            idleCheck -= gcw.scanWork
            gcw.scanWork = 0
            
            // 若是是idle模式且達到了檢查的掃描量, 則檢查是否有其餘任務(G), 若是有則跳出循環
            if idle && idleCheck <= 0 {
                idleCheck += idleCheckThreshold
                if pollWork() {
                    break
                }
            }
        }
    }
    
    // In blocking mode, write barriers are not allowed after this
    // point because we must preserve the condition that the work
    // buffers are empty.
    
done:
    // 把掃描的對象數量添加到全局
    // Flush remaining scan work credit.
    if gcw.scanWork > 0 {
        atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
        // 減小輔助GC的工做量和喚醒等待中的G
        if flushBgCredit {
            gcFlushBgCredit(gcw.scanWork - initScanWork)
        }
        gcw.scanWork = 0
    }
}

任務會結束 // Disable preemption and prepare to reattach // to the P. // // We may be running on a different P at this // point, so we can't reattach until this G is // parked. park.m.set(acquirem()) park.attach.set(_p_) } } }

　　gcDrain函數用於執行標記

// gcDrain scans roots and objects in work buffers, blackening grey
// objects until all roots and work buffers have been drained.
//
// If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt
// is set. This implies gcDrainNoBlock.
//
// If flags&gcDrainIdle != 0, gcDrain returns when there is other work
// to do. This implies gcDrainNoBlock.
//
// If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is
// unable to get more work. Otherwise, it will block until all
// blocking calls are blocked in gcDrain.
//
// If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work
// credit to gcController.bgScanCredit every gcCreditSlack units of
// scan work.
//
//go:nowritebarrier
func gcDrain(gcw *gcWork, flags gcDrainFlags) {
    if !writeBarrier.needed {
        throw("gcDrain phase incorrect")
    }
    
    gp := getg().m.curg
    
    // 看到搶佔標誌時是否要返回
    preemptible := flags&gcDrainUntilPreempt != 0
    
    // 沒有任務時是否要等待任務
    blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0
    
    // 是否計算後臺的掃描量來減小輔助GC和喚醒等待中的G
    flushBgCredit := flags&gcDrainFlushBgCredit != 0
    
    // 是否只執行必定量的工做
    idle := flags&gcDrainIdle != 0
    
    // 記錄初始的已掃描數量
    initScanWork := gcw.scanWork
    
    // 掃描idleCheckThreshold(100000)個對象之後檢查是否要返回
    // idleCheck is the scan work at which to perform the next
    // idle check with the scheduler.
    idleCheck := initScanWork + idleCheckThreshold
    
    // 若是根對象未掃描完, 則先掃描根對象
    // Drain root marking jobs.
    if work.markrootNext < work.markrootJobs {
        // 若是標記了preemptible, 循環直到被搶佔
        for !(preemptible && gp.preempt) {
            // 從根對象掃描隊列取出一個值(原子遞增)
            job := atomic.Xadd(&work.markrootNext, +1) - 1
            if job >= work.markrootJobs {
                break
            }
            // 執行根對象掃描工做
            markroot(gcw, job)
            // 若是是idle模式而且有其餘工做, 則返回
            if idle && pollWork() {
                goto done
            }
        }
    }
    
    // 根對象已經在標記隊列中, 消費標記隊列
    // 若是標記了preemptible, 循環直到被搶佔
    // Drain heap marking jobs.
    for !(preemptible && gp.preempt) {
        // 若是全局標記隊列爲空, 把本地標記隊列的一部分工做分過去
        // (若是wbuf2不爲空則移動wbuf2過去, 不然移動wbuf1的一半過去)
        // Try to keep work available on the global queue. We used to
        // check if there were waiting workers, but it's better to
        // just keep work available than to make workers wait. In the
        // worst case, we'll do O(log(_WorkbufSize)) unnecessary
        // balances.
        if work.full == 0 {
            gcw.balance()
        }
        
        // 從本地標記隊列中獲取對象, 獲取不到則從全局標記隊列獲取
        var b uintptr
        if blocking {
            // 阻塞獲取
            b = gcw.get()
        } else {
            // 非阻塞獲取
            b = gcw.tryGetFast()
            if b == 0 {
                b = gcw.tryGet()
            }
        }
        
        // 獲取不到對象, 標記隊列已爲空, 跳出循環
        if b == 0 {
            // work barrier reached or tryGet failed.
            break
        }
        
        // 掃描獲取到的對象
        scanobject(b, gcw)
        
        // 若是已經掃描了必定數量的對象(gcCreditSlack的值是2000)
        // Flush background scan work credit to the global
        // account if we've accumulated enough locally so
        // mutator assists can draw on it.
        if gcw.scanWork >= gcCreditSlack {
            // 把掃描的對象數量添加到全局
            atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
            // 減小輔助GC的工做量和喚醒等待中的G
            if flushBgCredit {
                gcFlushBgCredit(gcw.scanWork - initScanWork)
                initScanWork = 0
            }
            idleCheck -= gcw.scanWork
            gcw.scanWork = 0
            
            // 若是是idle模式且達到了檢查的掃描量, 則檢查是否有其餘任務(G), 若是有則跳出循環
            if idle && idleCheck <= 0 {
                idleCheck += idleCheckThreshold
                if pollWork() {
                    break
                }
            }
        }
    }
    
    // In blocking mode, write barriers are not allowed after this
    // point because we must preserve the condition that the work
    // buffers are empty.
    
done:
    // 把掃描的對象數量添加到全局
    // Flush remaining scan work credit.
    if gcw.scanWork > 0 {
        atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
        // 減小輔助GC的工做量和喚醒等待中的G
        if flushBgCredit {
            gcFlushBgCredit(gcw.scanWork - initScanWork)
        }
        gcw.scanWork = 0
    }
}

　　markroot函數用於執行根對象掃描工做:

// markroot scans the i'th root.
//
// Preemption must be disabled (because this uses a gcWork).
//
// nowritebarrier is only advisory here.
//
//go:nowritebarrier
func markroot(gcw *gcWork, i uint32) {
    // 判斷取出的數值對應哪一種任務
    // (google的工程師以爲這種辦法好笑)
    // TODO(austin): This is a bit ridiculous. Compute and store
    // the bases in gcMarkRootPrepare instead of the counts.
    baseFlushCache := uint32(fixedRootCount)
    baseData := baseFlushCache + uint32(work.nFlushCacheRoots)
    baseBSS := baseData + uint32(work.nDataRoots)
    baseSpans := baseBSS + uint32(work.nBSSRoots)
    baseStacks := baseSpans + uint32(work.nSpanRoots)
    end := baseStacks + uint32(work.nStackRoots)
 
    // Note: if you add a case here, please also update heapdump.go:dumproots.
    switch {
    // 釋放mcache中的全部span, 要求STW
    case baseFlushCache <= i && i < baseData:
        flushmcache(int(i - baseFlushCache))
 
    // 掃描可讀寫的全局變量
    // 這裏只會掃描i對應的block, 掃描時傳入包含哪裏有指針的bitmap數據
    case baseData <= i && i < baseBSS:
        for _, datap := range activeModules() {
            markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData))
        }
 
    // 掃描只讀的全局變量
    // 這裏只會掃描i對應的block, 掃描時傳入包含哪裏有指針的bitmap數據
    case baseBSS <= i && i < baseSpans:
        for _, datap := range activeModules() {
            markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS))
        }
 
    // 掃描析構器隊列
    case i == fixedRootFinalizers:
        // Only do this once per GC cycle since we don't call
        // queuefinalizer during marking.
        if work.markrootDone {
            break
        }
        for fb := allfin; fb != nil; fb = fb.alllink {
            cnt := uintptr(atomic.Load(&fb.cnt))
            scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw)
        }
 
    // 釋放已停止的G的棧
    case i == fixedRootFreeGStacks:
        // Only do this once per GC cycle; preferably
        // concurrently.
        if !work.markrootDone {
            // Switch to the system stack so we can call
            // stackfree.
            systemstack(markrootFreeGStacks)
        }
 
    // 掃描各個span中特殊對象(析構器列表)
    case baseSpans <= i && i < baseStacks:
        // mark MSpan.specials
        markrootSpans(gcw, int(i-baseSpans))
 
    // 掃描各個G的棧
    default:
        // 獲取須要掃描的G
        // the rest is scanning goroutine stacks
        var gp *g
        if baseStacks <= i && i < end {
            gp = allgs[i-baseStacks]
        } else {
            throw("markroot: bad index")
        }
 
        // 記錄等待開始的時間
        // remember when we've first observed the G blocked
        // needed only to output in traceback
        status := readgstatus(gp) // We are not in a scan state
        if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {
            gp.waitsince = work.tstart
        }
 
        // 切換到g0運行(有可能會掃到本身的棧)
        // scang must be done on the system stack in case
        // we're trying to scan our own stack.
        systemstack(func() {
            // 判斷掃描的棧是否本身的
            // If this is a self-scan, put the user G in
            // _Gwaiting to prevent self-deadlock. It may
            // already be in _Gwaiting if this is a mark
            // worker or we're in mark termination.
            userG := getg().m.curg
            selfScan := gp == userG && readgstatus(userG) == _Grunning
            
            // 若是正在掃描本身的棧則切換狀態到等待中防止死鎖
            if selfScan {
                casgstatus(userG, _Grunning, _Gwaiting)
                userG.waitreason = "garbage collection scan"
            }
            
            // 掃描G的棧
            // TODO: scang blocks until gp's stack has
            // been scanned, which may take a while for
            // running goroutines. Consider doing this in
            // two phases where the first is non-blocking:
            // we scan the stacks we can and ask running
            // goroutines to scan themselves; and the
            // second blocks.
            scang(gp, gcw)
            
            // 若是正在掃描本身的棧則把狀態切換回運行中
            if selfScan {
                casgstatus(userG, _Gwaiting, _Grunning)
            }
        })
    }
}

　　scang函數負責掃描G的棧:

// scang blocks until gp's stack has been scanned.
// It might be scanned by scang or it might be scanned by the goroutine itself.
// Either way, the stack scan has completed when scang returns.
func scang(gp *g, gcw *gcWork) {
    // Invariant; we (the caller, markroot for a specific goroutine) own gp.gcscandone.
    // Nothing is racing with us now, but gcscandone might be set to true left over
    // from an earlier round of stack scanning (we scan twice per GC).
    // We use gcscandone to record whether the scan has been done during this round.
 
    // 標記掃描未完成
    gp.gcscandone = false
 
    // See http://golang.org/cl/21503 for justification of the yield delay.
    const yieldDelay = 10 * 1000
    var nextYield int64
 
    // 循環直到掃描完成
    // Endeavor to get gcscandone set to true,
    // either by doing the stack scan ourselves or by coercing gp to scan itself.
    // gp.gcscandone can transition from false to true when we're not looking
    // (if we asked for preemption), so any time we lock the status using
    // castogscanstatus we have to double-check that the scan is still not done.
loop:
    for i := 0; !gp.gcscandone; i++ {
        // 判斷G的當前狀態
        switch s := readgstatus(gp); s {
        default:
            dumpgstatus(gp)
            throw("stopg: invalid status")
 
        // G已停止, 不須要掃描它
        case _Gdead:
            // No stack.
            gp.gcscandone = true
            break loop
 
        // G的棧正在擴展, 下一輪重試
        case _Gcopystack:
        // Stack being switched. Go around again.
 
        // G不是運行中, 首先須要防止它運行
        case _Grunnable, _Gsyscall, _Gwaiting:
            // Claim goroutine by setting scan bit.
            // Racing with execution or readying of gp.
            // The scan bit keeps them from running
            // the goroutine until we're done.
            if castogscanstatus(gp, s, s|_Gscan) {
                // 原子切換狀態成功時掃描它的棧
                if !gp.gcscandone {
                    scanstack(gp, gcw)
                    gp.gcscandone = true
                }
                // 恢復G的狀態, 並跳出循環
                restartg(gp)
                break loop
            }
 
        // G正在掃描它本身, 等待掃描完畢
        case _Gscanwaiting:
        // newstack is doing a scan for us right now. Wait.
 
        // G正在運行
        case _Grunning:
            // Goroutine running. Try to preempt execution so it can scan itself.
            // The preemption handler (in newstack) does the actual scan.
 
            // 若是已經有搶佔請求, 則搶佔成功時會幫咱們處理
            // Optimization: if there is already a pending preemption request
            // (from the previous loop iteration), don't bother with the atomics.
            if gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt {
                break
            }
 
            // 搶佔G, 搶佔成功時G會掃描它本身
            // Ask for preemption and self scan.
            if castogscanstatus(gp, _Grunning, _Gscanrunning) {
                if !gp.gcscandone {
                    gp.preemptscan = true
                    gp.preempt = true
                    gp.stackguard0 = stackPreempt
                }
                casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning)
            }
        }
 
        // 第一輪休眠10毫秒, 第二輪休眠5毫秒
        if i == 0 {
            nextYield = nanotime() + yieldDelay
        }
        if nanotime() < nextYield {
            procyield(10)
        } else {
            osyield()
            nextYield = nanotime() + yieldDelay/2
        }
    }
 
    // 掃描完成, 取消搶佔掃描的請求
    gp.preemptscan = false // cancel scan request if no longer needed
}

設置preemptscan後, 在搶佔G成功時會調用scanstack掃描它本身的棧, 具體代碼在這裏.
掃描棧用的函數是scanstack:

// scanstack scans gp's stack, greying all pointers found on the stack.
//
// scanstack is marked go:systemstack because it must not be preempted
// while using a workbuf.
//
//go:nowritebarrier
//go:systemstack
func scanstack(gp *g, gcw *gcWork) {
    if gp.gcscanvalid {
        return
    }
 
    if readgstatus(gp)&_Gscan == 0 {
        print("runtime:scanstack: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", hex(readgstatus(gp)), "\n")
        throw("scanstack - bad status")
    }
 
    switch readgstatus(gp) &^ _Gscan {
    default:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
        throw("mark - bad status")
    case _Gdead:
        return
    case _Grunning:
        print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
        throw("scanstack: goroutine not stopped")
    case _Grunnable, _Gsyscall, _Gwaiting:
        // ok
    }
 
    if gp == getg() {
        throw("can't scan our own stack")
    }
    mp := gp.m
    if mp != nil && mp.helpgc != 0 {
        throw("can't scan gchelper stack")
    }
 
    // Shrink the stack if not much of it is being used. During
    // concurrent GC, we can do this during concurrent mark.
    if !work.markrootDone {
        shrinkstack(gp)
    }
 
    // Scan the stack.
    var cache pcvalueCache
    scanframe := func(frame *stkframe, unused unsafe.Pointer) bool {
        // scanframeworker會根據代碼地址(pc)獲取函數信息
        // 而後找到函數信息中的stackmap.bytedata, 它保存了函數的棧上哪些地方有指針
        // 再調用scanblock來掃描函數的棧空間, 同時函數的參數也會這樣掃描
        scanframeworker(frame, &cache, gcw)
        return true
    }
    // 枚舉全部調用幀, 分別調用scanframe函數
    gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0)
    // 枚舉全部defer的調用幀, 分別調用scanframe函數
    tracebackdefers(gp, scanframe, nil)
    gp.gcscanvalid = true
}

　　scanblock函數是一個通用的掃描函數, 掃描全局變量和棧空間都會用它, 和scanobject不一樣的是bitmap須要手動傳入:

// scanblock scans b as scanobject would, but using an explicit
// pointer bitmap instead of the heap bitmap.
//
// This is used to scan non-heap roots, so it does not update
// gcw.bytesMarked or gcw.scanWork.
//
//go:nowritebarrier
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
    // Use local copies of original parameters, so that a stack trace
    // due to one of the throws below shows the original block
    // base and extent.
    b := b0
    n := n0
 
    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used
 
    // 枚舉掃描的地址
    for i := uintptr(0); i < n; {
        // 找到bitmap中對應的byte
        // Find bits for the next word.
        bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
        if bits == 0 {
            i += sys.PtrSize * 8
            continue
        }
        // 枚舉byte
        for j := 0; j < 8 && i < n; j++ {
            // 若是該地址包含指針
            if bits&1 != 0 {
                // 標記在該地址的對象存活, 並把它加到標記隊列(該對象變爲灰色)
                // Same work as in scanobject; see comments there.
                obj := *(*uintptr)(unsafe.Pointer(b + i))
                if obj != 0 && arena_start <= obj && obj < arena_used {
                    // 找到該對象對應的span和bitmap
                    if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
                        // 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
                        greyobject(obj, b, i, hbits, span, gcw, objIndex)
                    }
                }
            }
            // 處理下一個指針下一個bit
            bits >>= 1
            i += sys.PtrSize
        }
    }
}

　　greyobject用於標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色):

// obj is the start of an object with mark mbits.
// If it isn't already marked, mark it and enqueue into gcw.
// base and off are for debugging only and could be removed.
//go:nowritebarrierrec
func greyobject(obj, base, off uintptr, hbits heapBits, span *mspan, gcw *gcWork, objIndex uintptr) {
    // obj should be start of allocation, and so must be at least pointer-aligned.
    if obj&(sys.PtrSize-1) != 0 {
        throw("greyobject: obj not pointer-aligned")
    }
    mbits := span.markBitsForIndex(objIndex)
 
    if useCheckmark {
        // checkmark是用於檢查是否全部可到達的對象都被正確標記的機制, 僅除錯使用
        if !mbits.isMarked() {
            printlock()
            print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "\n")
            print("runtime: found obj at *(", hex(base), "+", hex(off), ")\n")
 
            // Dump the source (base) object
            gcDumpObject("base", base, off)
 
            // Dump the object
            gcDumpObject("obj", obj, ^uintptr(0))
 
            getg().m.traceback = 2
            throw("checkmark found unmarked object")
        }
        if hbits.isCheckmarked(span.elemsize) {
            return
        }
        hbits.setCheckmarked(span.elemsize)
        if !hbits.isCheckmarked(span.elemsize) {
            throw("setCheckmarked and isCheckmarked disagree")
        }
    } else {
        if debug.gccheckmark > 0 && span.isFree(objIndex) {
            print("runtime: marking free object ", hex(obj), " found at *(", hex(base), "+", hex(off), ")\n")
            gcDumpObject("base", base, off)
            gcDumpObject("obj", obj, ^uintptr(0))
            getg().m.traceback = 2
            throw("marking free object")
        }
 
        // 若是對象所在的span中的gcmarkBits對應的bit已經設置爲1則能夠跳過處理
        // If marked we have nothing to do.
        if mbits.isMarked() {
            return
        }
        
        // 設置對象所在的span中的gcmarkBits對應的bit爲1
        // mbits.setMarked() // Avoid extra call overhead with manual inlining.
        atomic.Or8(mbits.bytep, mbits.mask)
        
        // 若是肯定對象不包含指針(所在span的類型是noscan), 則不須要把對象放入標記隊列
        // If this is a noscan object, fast-track it to black
        // instead of greying it.
        if span.spanclass.noscan() {
            gcw.bytesMarked += uint64(span.elemsize)
            return
        }
    }
 
    // 把對象放入標記隊列
    // 先放入本地標記隊列, 失敗時把本地標記隊列中的部分工做轉移到全局標記隊列, 再放入本地標記隊列
    // Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
    // seems like a nice optimization that can be added back in.
    // There needs to be time between the PREFETCH and the use.
    // Previously we put the obj in an 8 element buffer that is drained at a rate
    // to give the PREFETCH time to do its work.
    // Use of PREFETCHNTA might be more appropriate than PREFETCH
    if !gcw.putFast(obj) {
        gcw.put(obj)
    }
}

　　gcDrain函數掃描完根對象, 就會開始消費標記隊列, 對從標記隊列中取出的對象調用scanobject函數:

// scanobject scans the object starting at b, adding pointers to gcw.
// b must point to the beginning of a heap object or an oblet.
// scanobject consults the GC bitmap for the pointer mask and the
// spans for the size of the object.
//
//go:nowritebarrier
func scanobject(b uintptr, gcw *gcWork) {
    // Note that arena_used may change concurrently during
    // scanobject and hence scanobject may encounter a pointer to
    // a newly allocated heap object that is *not* in
    // [start,used). It will not mark this object; however, we
    // know that it was just installed by a mutator, which means
    // that mutator will execute a write barrier and take care of
    // marking it. This is even more pronounced on relaxed memory
    // architectures since we access arena_used without barriers
    // or synchronization, but the same logic applies.
    arena_start := mheap_.arena_start
    arena_used := mheap_.arena_used
 
    // Find the bits for b and the size of the object at b.
    //
    // b is either the beginning of an object, in which case this
    // is the size of the object to scan, or it points to an
    // oblet, in which case we compute the size to scan below.
    // 獲取對象對應的bitmap
    hbits := heapBitsForAddr(b)
    
    // 獲取對象所在的span
    s := spanOfUnchecked(b)
    
    // 獲取對象的大小
    n := s.elemsize
    if n == 0 {
        throw("scanobject n == 0")
    }
 
    // 對象大小過大時(maxObletBytes是128KB)須要分割掃描
    // 每次最多隻掃描128KB
    if n > maxObletBytes {
        // Large object. Break into oblets for better
        // parallelism and lower latency.
        if b == s.base() {
            // It's possible this is a noscan object (not
            // from greyobject, but from other code
            // paths), in which case we must *not* enqueue
            // oblets since their bitmaps will be
            // uninitialized.
            if s.spanclass.noscan() {
                // Bypass the whole scan.
                gcw.bytesMarked += uint64(n)
                return
            }
 
            // Enqueue the other oblets to scan later.
            // Some oblets may be in b's scalar tail, but
            // these will be marked as "no more pointers",
            // so we'll drop out immediately when we go to
            // scan those.
            for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {
                if !gcw.putFast(oblet) {
                    gcw.put(oblet)
                }
            }
        }
 
        // Compute the size of the oblet. Since this object
        // must be a large object, s.base() is the beginning
        // of the object.
        n = s.base() + s.elemsize - b
        if n > maxObletBytes {
            n = maxObletBytes
        }
    }
 
    // 掃描對象中的指針
    var i uintptr
    for i = 0; i < n; i += sys.PtrSize {
        // 獲取對應的bit
        // Find bits for this word.
        if i != 0 {
            // Avoid needless hbits.next() on last iteration.
            hbits = hbits.next()
        }
        // Load bits once. See CL 22712 and issue 16973 for discussion.
        bits := hbits.bits()
        
        // 檢查scan bit判斷是否繼續掃描, 注意第二個scan bit是checkmark
        // During checkmarking, 1-word objects store the checkmark
        // in the type bit for the one word. The only one-word objects
        // are pointers, or else they'd be merged with other non-pointer
        // data into larger allocations.
        if i != 1*sys.PtrSize && bits&bitScan == 0 {
            break // no more pointers in this object
        }
        
        // 檢查pointer bit, 不是指針則繼續
        if bits&bitPointer == 0 {
            continue // not a pointer
        }
 
        // 取出指針的值
        // Work here is duplicated in scanblock and above.
        // If you make changes here, make changes there too.
        obj := *(*uintptr)(unsafe.Pointer(b + i))
 
        // 若是指針在arena區域中, 則調用greyobject標記對象並把對象放到標記隊列中
        // At this point we have extracted the next potential pointer.
        // Check if it points into heap and not back at the current object.
        if obj != 0 && arena_start <= obj && obj < arena_used && obj-b >= n {
            // Mark the object.
            if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
                greyobject(obj, b, i, hbits, span, gcw, objIndex)
            }
        }
    }
    
    // 統計掃描過的大小和對象數量
    gcw.bytesMarked += uint64(n)
    gcw.scanWork += int64(i)
}

在全部後臺標記任務都把標記隊列消費完畢時, 會執行gcMarkDone函數準備進入完成標記階段(mark termination):
在並行GC中gcMarkDone會被執行兩次, 第一次會禁止本地標記隊列而後從新開始後臺標記任務, 第二次會進入完成標記階段(mark termination)

// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
// to mark termination.
//
// This should be called when all mark work has been drained. In mark
// 1, this includes all root marking jobs, global work buffers, and
// active work buffers in assists and background workers; however,
// work may still be cached in per-P work buffers. In mark 2, per-P
// caches are disabled.
//
// The calling context must be preemptible.
//
// Note that it is explicitly okay to have write barriers in this
// function because completion of concurrent mark is best-effort
// anyway. Any work created by write barriers here will be cleaned up
// by mark termination.
func gcMarkDone() {
top:
    semacquire(&work.markDoneSema)
 
    // Re-check transition condition under transition lock.
    if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
        semrelease(&work.markDoneSema)
        return
    }
 
    // 暫時禁止啓動新的後臺標記任務
    // Disallow starting new workers so that any remaining workers
    // in the current mark phase will drain out.
    //
    // TODO(austin): Should dedicated workers keep an eye on this
    // and exit gcDrain promptly?
    atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
    atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, -0xffffffff)
 
    // 判斷本地標記隊列是否已禁用
    if !gcBlackenPromptly {
        // 本地標記隊列是否未禁用, 禁用而後從新開始後臺標記任務
        // Transition from mark 1 to mark 2.
        //
        // The global work list is empty, but there can still be work
        // sitting in the per-P work caches.
        // Flush and disable work caches.
 
        // 禁用本地標記隊列
        // Disallow caching workbufs and indicate that we're in mark 2.
        gcBlackenPromptly = true
 
        // Prevent completion of mark 2 until we've flushed
        // cached workbufs.
        atomic.Xadd(&work.nwait, -1)
 
        // GC is set up for mark 2. Let Gs blocked on the
        // transition lock go while we flush caches.
        semrelease(&work.markDoneSema)
 
        // 把全部本地標記隊列中的對象都推到全局標記隊列
        systemstack(func() {
            // Flush all currently cached workbufs and
            // ensure all Ps see gcBlackenPromptly. This
            // also blocks until any remaining mark 1
            // workers have exited their loop so we can
            // start new mark 2 workers.
            forEachP(func(_p_ *p) {
                _p_.gcw.dispose()
            })
        })
 
        // 除錯用
        // Check that roots are marked. We should be able to
        // do this before the forEachP, but based on issue
        // #16083 there may be a (harmless) race where we can
        // enter mark 2 while some workers are still scanning
        // stacks. The forEachP ensures these scans are done.
        //
        // TODO(austin): Figure out the race and fix this
        // properly.
        gcMarkRootCheck()
 
        // 容許啓動新的後臺標記任務
        // Now we can start up mark 2 workers.
        atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
        atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 0xffffffff)
 
        // 若是肯定沒有更多的任務則能夠直接跳到函數頂部
        // 這樣就看成是第二次調用了
        incnwait := atomic.Xadd(&work.nwait, +1)
        if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
            // This loop will make progress because
            // gcBlackenPromptly is now true, so it won't
            // take this same "if" branch.
            goto top
        }
    } else {
        // 記錄完成標記階段開始的時間和STW開始的時間
        // Transition to mark termination.
        now := nanotime()
        work.tMarkTerm = now
        work.pauseStart = now
        
        // 禁止G被搶佔
        getg().m.preemptoff = "gcing"
        
        // 中止全部運行中的G, 並禁止它們運行
        systemstack(stopTheWorldWithSema)
        
        // !!!!!!!!!!!!!!!!
        // 世界已中止(STW)...
        // !!!!!!!!!!!!!!!!
        
        // The gcphase is _GCmark, it will transition to _GCmarktermination
        // below. The important thing is that the wb remains active until
        // all marking is complete. This includes writes made by the GC.
        
        // 標記對根對象的掃描已完成, 會影響gcMarkRootPrepare中的處理
        // Record that one root marking pass has completed.
        work.markrootDone = true
        
        // 禁止輔助GC和後臺標記任務的運行
        // Disable assists and background workers. We must do
        // this before waking blocked assists.
        atomic.Store(&gcBlackenEnabled, 0)
        
        // 喚醒全部由於輔助GC而休眠的G
        // Wake all blocked assists. These will run when we
        // start the world again.
        gcWakeAllAssists()
        
        // Likewise, release the transition lock. Blocked
        // workers and assists will run when we start the
        // world again.
        semrelease(&work.markDoneSema)
        
        // 計算下一次觸發gc須要的heap大小
        // endCycle depends on all gcWork cache stats being
        // flushed. This is ensured by mark 2.
        nextTriggerRatio := gcController.endCycle()
        
        // 進入完成標記階段, 會從新啓動世界
        // Perform mark termination. This will restart the world.
        gcMarkTermination(nextTriggerRatio)
    }
}

　　gcMarkTermination函數會進入完成標記階段:

func gcMarkTermination(nextTriggerRatio float64) {
    // World is stopped.
    // Start marktermination which includes enabling the write barrier.
    // 禁止輔助GC和後臺標記任務的運行
    atomic.Store(&gcBlackenEnabled, 0)
    
    // 從新容許本地標記隊列(下次GC使用)
    gcBlackenPromptly = false
    
    // 設置當前GC階段到完成標記階段, 並啓用寫屏障
    setGCPhase(_GCmarktermination)
 
    // 記錄開始時間
    work.heap1 = memstats.heap_live
    startTime := nanotime()
 
    // 禁止G被搶佔
    mp := acquirem()
    mp.preemptoff = "gcing"
    _g_ := getg()
    _g_.m.traceback = 2
    
    // 設置G的狀態爲等待中這樣它的棧能夠被掃描
    gp := _g_.m.curg
    casgstatus(gp, _Grunning, _Gwaiting)
    gp.waitreason = "garbage collection"
 
    // 切換到g0運行
    // Run gc on the g0 stack. We do this so that the g stack
    // we're currently running on will no longer change. Cuts
    // the root set down a bit (g0 stacks are not scanned, and
    // we don't need to scan gc's internal state).  We also
    // need to switch to g0 so we can shrink the stack.
    systemstack(func() {
        // 開始STW中的標記
        gcMark(startTime)
        
        // 必須馬上返回, 由於外面的G的棧有可能被移動, 不能在這以後訪問外面的變量
        // Must return immediately.
        // The outer function's stack may have moved
        // during gcMark (it shrinks stacks, including the
        // outer function's stack), so we must not refer
        // to any of its variables. Return back to the
        // non-system stack to pick up the new addresses
        // before continuing.
    })
 
    // 從新切換到g0運行
    systemstack(func() {
        work.heap2 = work.bytesMarked
        
        // 若是啓用了checkmark則執行檢查, 檢查是否全部可到達的對象都有標記
        if debug.gccheckmark > 0 {
            // Run a full stop-the-world mark using checkmark bits,
            // to check that we didn't forget to mark anything during
            // the concurrent mark process.
            gcResetMarkState()
            initCheckmarks()
            gcMark(startTime)
            clearCheckmarks()
        }
 
        // 設置當前GC階段到關閉, 並禁用寫屏障
        // marking is complete so we can turn the write barrier off
        setGCPhase(_GCoff)
        
        // 喚醒後臺清掃任務, 將在STW結束後開始運行
        gcSweep(work.mode)
 
        // 除錯用
        if debug.gctrace > 1 {
            startTime = nanotime()
            // The g stacks have been scanned so
            // they have gcscanvalid==true and gcworkdone==true.
            // Reset these so that all stacks will be rescanned.
            gcResetMarkState()
            finishsweep_m()
 
            // Still in STW but gcphase is _GCoff, reset to _GCmarktermination
            // At this point all objects will be found during the gcMark which
            // does a complete STW mark and object scan.
            setGCPhase(_GCmarktermination)
            gcMark(startTime)
            setGCPhase(_GCoff) // marking is done, turn off wb.
            gcSweep(work.mode)
        }
    })
 
    // 設置G的狀態爲運行中
    _g_.m.traceback = 0
    casgstatus(gp, _Gwaiting, _Grunning)
 
    // 跟蹤處理
    if trace.enabled {
        traceGCDone()
    }
 
    // all done
    mp.preemptoff = ""
 
    if gcphase != _GCoff {
        throw("gc done but gcphase != _GCoff")
    }
 
    // 更新下一次觸發gc須要的heap大小(gc_trigger)
    // Update GC trigger and pacing for the next cycle.
    gcSetTriggerRatio(nextTriggerRatio)
 
    // 更新用時記錄
    // Update timing memstats
    now := nanotime()
    sec, nsec, _ := time_now()
    unixNow := sec*1e9 + int64(nsec)
    work.pauseNS += now - work.pauseStart
    work.tEnd = now
    atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
    atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
    memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
    memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
    memstats.pause_total_ns += uint64(work.pauseNS)
 
    // 更新所用cpu記錄
    // Update work.totaltime.
    sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
    // We report idle marking time below, but omit it from the
    // overall utilization here since it's "free".
    markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
    markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
    cycleCpu := sweepTermCpu + markCpu + markTermCpu
    work.totaltime += cycleCpu
 
    // Compute overall GC CPU utilization.
    totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
    memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)
 
    // 重置清掃狀態
    // Reset sweep state.
    sweep.nbgsweep = 0
    sweep.npausesweep = 0
 
    // 統計強制開始GC的次數
    if work.userForced {
        memstats.numforcedgc++
    }
 
    // 統計執行GC的次數而後喚醒等待清掃的G
    // Bump GC cycle count and wake goroutines waiting on sweep.
    lock(&work.sweepWaiters.lock)
    memstats.numgc++
    injectglist(work.sweepWaiters.head.ptr())
    work.sweepWaiters.head = 0
    unlock(&work.sweepWaiters.lock)
 
    // 性能統計用
    // Finish the current heap profiling cycle and start a new
    // heap profiling cycle. We do this before starting the world
    // so events don't leak into the wrong cycle.
    mProf_NextCycle()
 
    // 從新啓動世界
    systemstack(startTheWorldWithSema)
 
    // !!!!!!!!!!!!!!!
    // 世界已從新啓動...
    // !!!!!!!!!!!!!!!
 
    // 性能統計用
    // Flush the heap profile so we can start a new cycle next GC.
    // This is relatively expensive, so we don't do it with the
    // world stopped.
    mProf_Flush()
 
    // 移動標記隊列使用的緩衝區到自由列表, 使得它們能夠被回收
    // Prepare workbufs for freeing by the sweeper. We do this
    // asynchronously because it can take non-trivial time.
    prepareFreeWorkbufs()
 
    // 釋放未使用的棧
    // Free stack spans. This must be done between GC cycles.
    systemstack(freeStackSpans)
 
    // 除錯用
    // Print gctrace before dropping worldsema. As soon as we drop
    // worldsema another cycle could start and smash the stats
    // we're trying to print.
    if debug.gctrace > 0 {
        util := int(memstats.gc_cpu_fraction * 100)
 
        var sbuf [24]byte
        printlock()
        print("gc ", memstats.numgc,
            " @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
            util, "%: ")
        prev := work.tSweepTerm
        for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
            if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
            prev = ns
        }
        print(" ms clock, ")
        for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
            if i == 2 || i == 3 {
                // Separate mark time components with /.
                print("/")
            } else if i != 0 {
                print("+")
            }
            print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
        }
        print(" ms cpu, ",
            work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
            work.heapGoal>>20, " MB goal, ",
            work.maxprocs, " P")
        if work.userForced {
            print(" (forced)")
        }
        print("\n")
        printunlock()
    }
 
    semrelease(&worldsema)
    // Careful: another GC cycle may start now.
 
    // 從新容許當前的G被搶佔
    releasem(mp)
    mp = nil
 
    // 若是是並行GC, 讓當前M繼續運行(會回到gcBgMarkWorker而後休眠)
    // 若是不是並行GC, 則讓當前M開始調度
    // now that gc is done, kick off finalizer thread if needed
    if !concurrentSweep {
        // give the queued finalizers, if any, a chance to run
        Gosched()
    }
}

　　gcSweep函數會喚醒後臺清掃任務:
後臺清掃任務會在程序啓動時調用的gcenable函數中啓動.

func gcSweep(mode gcMode) {
    if gcphase != _GCoff {
        throw("gcSweep being done but phase is not GCoff")
    }
 
    // 增長sweepgen, 這樣sweepSpans中兩個隊列角色會交換, 全部span都會變爲"待清掃"的span
    lock(&mheap_.lock)
    mheap_.sweepgen += 2
    mheap_.sweepdone = 0
    if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
        // We should have drained this list during the last
        // sweep phase. We certainly need to start this phase
        // with an empty swept list.
        throw("non-empty swept list")
    }
    mheap_.pagesSwept = 0
    unlock(&mheap_.lock)
 
    // 若是非並行GC則在這裏完成全部工做(STW中)
    if !_ConcurrentSweep || mode == gcForceBlockMode {
        // Special case synchronous sweep.
        // Record that no proportional sweeping has to happen.
        lock(&mheap_.lock)
        mheap_.sweepPagesPerByte = 0
        unlock(&mheap_.lock)
        // Sweep all spans eagerly.
        for sweepone() != ^uintptr(0) {
            sweep.npausesweep++
        }
        // Free workbufs eagerly.
        prepareFreeWorkbufs()
        for freeSomeWbufs(false) {
        }
        // All "free" events for this mark/sweep cycle have
        // now happened, so we can make this profile cycle
        // available immediately.
        mProf_NextCycle()
        mProf_Flush()
        return
    }
 
    // 喚醒後臺清掃任務
    // Background sweep.
    lock(&sweep.lock)
    if sweep.parked {
        sweep.parked = false
        ready(sweep.g, 0, true)
    }
    unlock(&sweep.lock)
}

　　後臺清掃任務的函數是bgsweep:

func bgsweep(c chan int) {
    sweep.g = getg()
 
    // 等待喚醒
    lock(&sweep.lock)
    sweep.parked = true
    c <- 1
    goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)
 
    // 循環清掃
    for {
        // 清掃一個span, 而後進入調度(一次只作少許工做)
        for gosweepone() != ^uintptr(0) {
            sweep.nbgsweep++
            Gosched()
        }
        // 釋放一些未使用的標記隊列緩衝區到heap
        for freeSomeWbufs(true) {
            Gosched()
        }
        // 若是清掃未完成則繼續循環
        lock(&sweep.lock)
        if !gosweepdone() {
            // This can happen if a GC runs between
            // gosweepone returning ^0 above
            // and the lock being acquired.
            unlock(&sweep.lock)
            continue
        }
        // 不然讓後臺清掃任務進入休眠, 當前M繼續調度
        sweep.parked = true
        goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)
    }
}

　　sweepone函數以下:

// sweeps one span
// returns number of pages returned to heap, or ^uintptr(0) if there is nothing to sweep
//go:nowritebarrier
func sweepone() uintptr {
    _g_ := getg()
    sweepRatio := mheap_.sweepPagesPerByte // For debugging
 
    // 禁止G被搶佔
    // increment locks to ensure that the goroutine is not preempted
    // in the middle of sweep thus leaving the span in an inconsistent state for next GC
    _g_.m.locks++
    
    // 檢查是否已完成清掃
    if atomic.Load(&mheap_.sweepdone) != 0 {
        _g_.m.locks--
        return ^uintptr(0)
    }
    
    // 更新同時執行sweep的任務數量
    atomic.Xadd(&mheap_.sweepers, +1)
 
    npages := ^uintptr(0)
    sg := mheap_.sweepgen
    for {
        // 從sweepSpans中取出一個span
        s := mheap_.sweepSpans[1-sg/2%2].pop()
        // 所有清掃完畢時跳出循環
        if s == nil {
            atomic.Store(&mheap_.sweepdone, 1)
            break
        }
        // 其餘M已經在清掃這個span時跳過
        if s.state != mSpanInUse {
            // This can happen if direct sweeping already
            // swept this span, but in that case the sweep
            // generation should always be up-to-date.
            if s.sweepgen != sg {
                print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n")
                throw("non in-use span in unswept list")
            }
            continue
        }
        // 原子增長span的sweepgen, 失敗表示其餘M已經開始清掃這個span, 跳過
        if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) {
            continue
        }
        // 清掃這個span, 而後跳出循環
        npages = s.npages
        if !s.sweep(false) {
            // Span is still in-use, so this returned no
            // pages to the heap and the span needs to
            // move to the swept in-use list.
            npages = 0
        }
        break
    }
 
    // 更新同時執行sweep的任務數量
    // Decrement the number of active sweepers and if this is the
    // last one print trace information.
    if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
        if debug.gcpacertrace > 0 {
            print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n")
        }
    }
    // 容許G被搶佔
    _g_.m.locks--
    // 返回清掃的頁數
    return npages
}

　　span的sweep函數用於清掃單個span:

// Sweep frees or collects finalizers for blocks not marked in the mark phase.
// It clears the mark bits in preparation for the next GC round.
// Returns true if the span was returned to heap.
// If preserve=true, don't return it to heap nor relink in MCentral lists;
// caller takes care of it.
//TODO go:nowritebarrier
func (s *mspan) sweep(preserve bool) bool {
    // It's critical that we enter this function with preemption disabled,
    // GC must not start while we are in the middle of this function.
    _g_ := getg()
    if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 {
        throw("MSpan_Sweep: m is not locked")
    }
    sweepgen := mheap_.sweepgen
    if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
        print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
        throw("MSpan_Sweep: bad span state")
    }
 
    if trace.enabled {
        traceGCSweepSpan(s.npages * _PageSize)
    }
 
    // 統計已清理的頁數
    atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))
 
    spc := s.spanclass
    size := s.elemsize
    res := false
 
    c := _g_.m.mcache
    freeToHeap := false
 
    // The allocBits indicate which unmarked objects don't need to be
    // processed since they were free at the end of the last GC cycle
    // and were not allocated since then.
    // If the allocBits index is >= s.freeindex and the bit
    // is not marked then the object remains unallocated
    // since the last GC.
    // This situation is analogous to being on a freelist.
 
    // 判斷在special中的析構器, 若是對應的對象已經再也不存活則標記對象存活防止回收, 而後把析構器移到運行隊列
    // Unlink & free special records for any objects we're about to free.
    // Two complications here:
    // 1. An object can have both finalizer and profile special records.
    //    In such case we need to queue finalizer for execution,
    //    mark the object as live and preserve the profile special.
    // 2. A tiny object can have several finalizers setup for different offsets.
    //    If such object is not marked, we need to queue all finalizers at once.
    // Both 1 and 2 are possible at the same time.
    specialp := &s.specials
    special := *specialp
    for special != nil {
        // A finalizer can be set for an inner byte of an object, find object beginning.
        objIndex := uintptr(special.offset) / size
        p := s.base() + objIndex*size
        mbits := s.markBitsForIndex(objIndex)
        if !mbits.isMarked() {
            // This object is not marked and has at least one special record.
            // Pass 1: see if it has at least one finalizer.
            hasFin := false
            endOffset := p - s.base() + size
            for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next {
                if tmp.kind == _KindSpecialFinalizer {
                    // Stop freeing of object if it has a finalizer.
                    mbits.setMarkedNonAtomic()
                    hasFin = true
                    break
                }
            }
            // Pass 2: queue all finalizers _or_ handle profile record.
            for special != nil && uintptr(special.offset) < endOffset {
                // Find the exact byte for which the special was setup
                // (as opposed to object beginning).
                p := s.base() + uintptr(special.offset)
                if special.kind == _KindSpecialFinalizer || !hasFin {
                    // Splice out special record.
                    y := special
                    special = special.next
                    *specialp = special
                    freespecial(y, unsafe.Pointer(p), size)
                } else {
                    // This is profile record, but the object has finalizers (so kept alive).
                    // Keep special record.
                    specialp = &special.next
                    special = *specialp
                }
            }
        } else {
            // object is still live: keep special record
            specialp = &special.next
            special = *specialp
        }
    }
 
    // 除錯用
    if debug.allocfreetrace != 0 || raceenabled || msanenabled {
        // Find all newly freed objects. This doesn't have to
        // efficient; allocfreetrace has massive overhead.
        mbits := s.markBitsForBase()
        abits := s.allocBitsForIndex(0)
        for i := uintptr(0); i < s.nelems; i++ {
            if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) {
                x := s.base() + i*s.elemsize
                if debug.allocfreetrace != 0 {
                    tracefree(unsafe.Pointer(x), size)
                }
                if raceenabled {
                    racefree(unsafe.Pointer(x), size)
                }
                if msanenabled {
                    msanfree(unsafe.Pointer(x), size)
                }
            }
            mbits.advance()
            abits.advance()
        }
    }
 
    // 計算釋放的對象數量
    // Count the number of free objects in this span.
    nalloc := uint16(s.countAlloc())
    if spc.sizeclass() == 0 && nalloc == 0 {
        // 若是span的類型是0(大對象)而且其中的對象已經不存活則釋放到heap
        s.needzero = 1
        freeToHeap = true
    }
    nfreed := s.allocCount - nalloc
    if nalloc > s.allocCount {
        print("runtime: nelems=", s.nelems, " nalloc=", nalloc, " previous allocCount=", s.allocCount, " nfreed=", nfreed, "\n")
        throw("sweep increased allocation count")
    }
 
    // 設置新的allocCount
    s.allocCount = nalloc
 
    // 判斷span是否無未分配的對象
    wasempty := s.nextFreeIndex() == s.nelems
 
    // 重置freeindex, 下次分配從0開始搜索
    s.freeindex = 0 // reset allocation index to start of span.
    if trace.enabled {
        getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize
    }
 
    // gcmarkBits變爲新的allocBits
    // 而後從新分配一塊所有爲0的gcmarkBits
    // 下次分配對象時能夠根據allocBits得知哪些元素是未分配的
    // gcmarkBits becomes the allocBits.
    // get a fresh cleared gcmarkBits in preparation for next GC
    s.allocBits = s.gcmarkBits
    s.gcmarkBits = newMarkBits(s.nelems)
 
    // 更新freeindex開始的allocCache
    // Initialize alloc bits cache.
    s.refillAllocCache(0)
 
    // 若是span中已經無存活的對象則更新sweepgen到最新
    // 下面會把span加到mcentral或者mheap
    // We need to set s.sweepgen = h.sweepgen only when all blocks are swept,
    // because of the potential for a concurrent free/SetFinalizer.
    // But we need to set it before we make the span available for allocation
    // (return it to heap or mcentral), because allocation code assumes that a
    // span is already swept if available for allocation.
    if freeToHeap || nfreed == 0 {
        // The span must be in our exclusive ownership until we update sweepgen,
        // check for potential races.
        if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
            print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
            throw("MSpan_Sweep: bad span state after sweep")
        }
        // Serialization point.
        // At this point the mark bits are cleared and allocation ready
        // to go so release the span.
        atomic.Store(&s.sweepgen, sweepgen)
    }
 
    if nfreed > 0 && spc.sizeclass() != 0 {
        // 把span加到mcentral, res等因而否添加成功
        c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed)
        res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty)
        // freeSpan會更新sweepgen
        // MCentral_FreeSpan updates sweepgen
    } else if freeToHeap {
        // 把span釋放到mheap
        // Free large span to heap
 
        // NOTE(rsc,dvyukov): The original implementation of efence
        // in CL 22060046 used SysFree instead of SysFault, so that
        // the operating system would eventually give the memory
        // back to us again, so that an efence program could run
        // longer without running out of memory. Unfortunately,
        // calling SysFree here without any kind of adjustment of the
        // heap data structures means that when the memory does
        // come back to us, we have the wrong metadata for it, either in
        // the MSpan structures or in the garbage collection bitmap.
        // Using SysFault here means that the program will run out of
        // memory fairly quickly in efence mode, but at least it won't
        // have mysterious crashes due to confused memory reuse.
        // It should be possible to switch back to SysFree if we also
        // implement and then call some kind of MHeap_DeleteSpan.
        if debug.efence > 0 {
            s.limit = 0 // prevent mlookup from finding this span
            sysFault(unsafe.Pointer(s.base()), size)
        } else {
            mheap_.freeSpan(s, 1)
        }
        c.local_nlargefree++
        c.local_largefree += size
        res = true
    }
    
    // 若是span未加到mcentral或者未釋放到mheap, 則表示span仍在使用
    if !res {
        // 把仍在使用的span加到sweepSpans的"已清掃"隊列中
        // The span has been swept and is still in-use, so put
        // it on the swept in-use list.
        mheap_.sweepSpans[sweepgen/2%2].push(s)
    }
    return res
}

從bgsweep和前面的分配器能夠看出掃描階段的工做是十分懶惰(lazy)的,
實際可能會出現前一階段的掃描還未完成, 就須要開始新一輪的GC的狀況,
因此每一輪GC開始以前都須要完成前一輪GC的掃描工做(Sweep Termination階段).

GC的整個流程都分析完畢了, 最後貼上寫屏障函數writebarrierptr的實現:

// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
    if writeBarrier.cgo {
        cgoCheckWriteBarrier(dst, src)
    }
    if !writeBarrier.needed {
        *dst = src
        return
    }
    if src != 0 && src < minPhysPageSize {
        systemstack(func() {
            print("runtime: writebarrierptr *", dst, " = ", hex(src), "\n")
            throw("bad pointer in write barrier")
        })
    }
    // 標記指針
    writebarrierptr_prewrite1(dst, src)
    // 設置指針到目標
    *dst = src
}

　　gcmarkwb_m函數以下:

func gcmarkwb_m(slot *uintptr, ptr uintptr) {
    if writeBarrier.needed {
        // Note: This turns bad pointer writes into bad
        // pointer reads, which could be confusing. We avoid
        // reading from obviously bad pointers, which should
        // take care of the vast majority of these. We could
        // patch this up in the signal handler, or use XCHG to
        // combine the read and the write. Checking inheap is
        // insufficient since we need to track changes to
        // roots outside the heap.
        //
        // Note: profbuf.go omits a barrier during signal handler
        // profile logging; that's safe only because this deletion barrier exists.
        // If we remove the deletion barrier, we'll have to work out
        // a new way to handle the profile logging.
        if slot1 := uintptr(unsafe.Pointer(slot)); slot1 >= minPhysPageSize {
            if optr := *slot; optr != 0 {
                // 標記舊指針
                shade(optr)
            }
        }
        // TODO: Make this conditional on the caller's stack color.
        if ptr != 0 && inheap(ptr) {
            // 標記新指針
            shade(ptr)
        }
    }
}

　　shade函數以下:

// Shade the object if it isn't already.
// The object is not nil and known to be in the heap.
// Preemption must be disabled.
//go:nowritebarrier
func shade(b uintptr) {
    if obj, hbits, span, objIndex := heapBitsForObject(b, 0, 0); obj != 0 {
        gcw := &getg().m.p.ptr().gcw
        // 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
        greyobject(obj, 0, 0, hbits, span, gcw, objIndex)
        // 若是標記了禁止本地標記隊列則flush到全局標記隊列
        if gcphase == _GCmarktermination || gcBlackenPromptly {
            // Ps aren't allowed to cache work during mark
            // termination.
            gcw.dispose()
        }
    }
}

參考連接

https://github.com/golang/go
https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md
https://golang.org/s/go15gcpacing
https://golang.org/ref/mem
https://talks.golang.org/2015/go-gc.pdf
https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit#heading=h.x4kziklnb8fr
https://go-review.googlesource.com/c/go/+/21503
http://www.cnblogs.com/diegodu/p/5803202.html
http://legendtkl.com/2017/04/28/golang-gc
https://lengzzz.com/note/gc-in-golang

Golang的GC和CoreCLR的GC對比

由於我以前已經對CoreCLR的GC作過度析(看這一篇和這一篇), 這裏我能夠簡單的對比一下CoreCLR和GO的GC實現:

CoreCLR的對象帶有類型信息, GO的對象不帶, 而是經過bitmap區域記錄哪些地方包含指針
CoreCLR分配對象的速度明顯更快, GO分配對象須要查找span和寫入bitmap區域
CoreCLR的收集器須要作的工做比GO多不少
- CoreCLR不一樣大小的對象都會放在一個segment中, 只能線性掃描
- CoreCLR判斷對象引用要訪問類型信息, 而go只須要訪問bitmap
- CoreCLR清掃時要一個個去標記爲自由對象, 而go只須要切換allocBits
CoreCLR的停頓時間比GO要長
- 雖然CoreCLR支持並行GC, 可是沒有GO完全, GO連掃描根對象都不須要徹底停頓
CoreCLR支持分代GC
- 雖然Full GC時CoreCLR的效率不如GO, 可是CoreCLR能夠在大部分時候只掃描第0和第1代的對象
- 由於支持分代GC, 一般CoreCLR花在GC上的CPU時間會比GO要少