Golang內存管理

時間 2019-11-29

標籤 golang 內存管理欄目 Go 简体版

原文原文鏈接

Golang 內存管理

原文連接[http://legendtkl.com/2017/04/02/golang-alloc/]
Golang 的內存管理基於 tcmalloc，能夠提及點挺高的。可是 Golang 在實現的時候還作了不少優化，咱們下面經過源碼來看一下 Golang 的內存管理實現。下面的源碼分析基於 go1.8rc3。
關於 tcmalloc 能夠參考這篇文章 tcmalloc 介紹，原始論文能夠參考 TCMalloc : Thread-Caching Malloc。linux

1. Golang 內存管理

1.1 準備知識

這裏先簡單介紹一下 Golang 運行調度。在 Golang 裏面有三個基本的概念：G, M, P。golang

G: Goroutine 執行的上下文環境。
M: 操做系統線程。
P: Processer。進程調度的關鍵，調度器，也能夠認爲約等於 CPU。
一個 Goroutine 的運行須要 G + P + M 三部分結合起來。好，先簡單介紹到這裏，更詳細的放在後面的文章裏面來講。編程

1.2. 逃逸分析（escape analysis）

對於手動管理內存的語言，好比 C/C++，咱們使用 malloc 或者 new 申請的變量會被分配到堆上。可是 Golang 並非這樣，雖然 Golang 語言裏面也有 new。Golang 編譯器決定變量應該分配到什麼地方時會進行逃逸分析。下面是一個簡單的例子。bootstrap

package main

import ()

func foo() *int {
    var x int
    return &x
}

func bar() int {
    x := new(int)
    *x = 1
    return *x
}

func main() {}

將上面文件保存爲 escape.go，執行下面命令windows

$ go run -gcflags '-m -l' escape.go
./escape.go:6: moved to heap: x
./escape.go:7: &x escape to heap
./escape.go:11: bar new(int) does not escape

上面的意思是 foo() 中的 x 最後在堆上分配，而 bar() 中的 x 最後分配在了棧上。在官網 (golang.org) FAQ 上有一個關於變量分配的問題以下：數組

How do I know whether a variable is allocated on the heap or the stack?
From a correctness standpoint, you don’t need to know. Each variable in Go exists as long as there are references to it. The storage location chosen by the implementation is irrelevant to the semantics of the language.

The storage location does have an effect on writing efficient programs. When possible, the Go compilers will allocate variables that are local to a function in that function’s stack frame. However, if the compiler cannot prove that the variable is not referenced after the function returns, then the compiler must allocate the variable on the garbage-collected heap to avoid dangling pointer errors. Also, if a local variable is very large, it might make more sense to store it on the heap rather than the stack.

In the current compilers, if a variable has its address taken, that variable is a candidate for allocation on the heap. However, a basic escape analysis recognizes some cases when such variables will not live past the return from the function and can reside on the stack.

簡單翻譯一下。安全

如何得知變量是分配在棧（stack）上仍是堆（heap）上？
準確地說，你並不須要知道。Golang 中的變量只要被引用就一直會存活，存儲在堆上仍是棧上由內部實現決定而和具體的語法沒有關係。

知道變量的存儲位置確實和效率編程有關係。若是可能，Golang 編譯器會將函數的局部變量分配到函數棧幀（stack frame）上。然而，若是編譯器不能確保變量在函數 return 以後再也不被引用，編譯器就會將變量分配到堆上。並且，若是一個局部變量很是大，那麼它也應該被分配到堆上而不是棧上。

當前狀況下，若是一個變量被取地址，那麼它就有可能被分配到堆上。然而，還要對這些變量作逃逸分析，若是函數 return 以後，變量再也不被引用，則將其分配到棧上。

2. 關鍵數據結構

幾個關鍵的地方：數據結構

mcache: per-P cache，能夠認爲是 local cache。
mcentral: 全局 cache，mcache 不夠用的時候向 mcentral 申請。
mheap: 當 mcentral 也不夠用的時候，經過 mheap 向操做系統申請。
能夠將其當作多級內存分配器。

2.1 mcache

咱們知道每一個 Gorontine 的運行都是綁定到一個 P 上面，mcache 是每一個 P 的 cache。這麼作的好處是分配內存時不須要加鎖。mcache 結構以下。併發

// Per-thread (in Go, per-P) cache for small objects.
// No locking needed because it is per-thread (per-P).
type mcache struct {
    // The following members are accessed on every malloc,
    // so they are grouped here for better caching.
    next_sample int32   // trigger heap sample after allocating this many bytes
    local_scan  uintptr // bytes of scannable heap allocated

    // 小對象分配器，小於 16 byte 的小對象都會經過 tiny 來分配。
    tiny             uintptr
    tinyoffset       uintptr
    local_tinyallocs uintptr // number of tiny allocs not counted in other stats

    // The rest is not accessed on every malloc.
    alloc [_NumSizeClasses]*mspan // spans to allocate from

    stackcache [_NumStackOrders]stackfreelist

    // Local allocator stats, flushed during GC.
    local_nlookup    uintptr                  // number of pointer lookups
    local_largefree  uintptr                  // bytes freed for large objects (>maxsmallsize)
    local_nlargefree uintptr                  // number of frees for large objects (>maxsmallsize)
    local_nsmallfree [_NumSizeClasses]uintptr // number of frees for small objects (<=maxsmallsize)
}

咱們能夠暫時只關注alloc [_NumSizeClasses]*mspan，這是一個大小爲 67 的指針（指針指向 mspan ）數組（_NumSizeClasses = 67），每一個數組元素用來包含特定大小的塊。當要分配內存大小時，爲 object 在 alloc 數組中選擇合適的元素來分配。67 種塊大小爲 0，8 byte, 16 byte, …，這個和 tcmalloc 稍有區別。app

//file: sizeclasses.go
var class_to_size = [_NumSizeClasses]uint16{0, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 896, 1024, 1152, 1280, 1408, 1536, 1792, 2048, 2304, 2688, 3072, 3200, 3456, 4096, 4864, 5376, 6144, 6528, 6784, 6912, 8192, 9472, 9728, 10240, 10880, 12288, 13568, 14336, 16384, 18432, 19072, 20480, 21760, 24576, 27264, 28672, 32768}

這裏仔細想有個小問題，上面的 alloc 相似內存池的 freelist 數組或者鏈表，正常實現每一個數組元素是一個鏈表，鏈表由特定大小的塊串起來。可是這裏統一使用了 mspan 結構，那麼只有一種可能，就是 mspan 中記錄了須要分配的塊大小。咱們來看一下 mspan 的結構。

2.2 mspan

span 在 tcmalloc 中做爲一種管理內存的基本單位而存在。Golang 的 mspan 的結構以下，省略了部份內容。

type mspan struct {
    next *mspan     // next span in list, or nil if none
    prev *mspan     // previous span in list, or nil if none
    list *mSpanList // For debugging. TODO: Remove.

    startAddr     uintptr   // address of first byte of span aka s.base()
    npages        uintptr   // number of pages in span
    stackfreelist gclinkptr // list of free stacks, avoids overloading freelist
    // freeindex is the slot index between 0 and nelems at which to begin scanning
    // for the next free object in this span.
    freeindex uintptr
    // TODO: Look up nelems from sizeclass and remove this field if it
    // helps performance.
    nelems uintptr // number of object in the span.
    ...
    // 用位圖來管理可用的 free object，1 表示可用
    allocCache uint64
    
    ...
    sizeclass   uint8      // size class
    ...
    elemsize    uintptr    // computed from sizeclass or from npages
    ...
}

從上面的結構能夠看出：

next, prev: 指針域，由於 mspan 通常都是以鏈表形式使用。
npages: mspan 的大小爲 page 大小的整數倍。
sizeclass: 0 ~ _NumSizeClasses 之間的一個值，這個解釋了咱們的疑問。好比，sizeclass = 3，那麼這個 mspan 被分割成 32 byte 的塊。
elemsize: 經過 sizeclass 或者 npages 能夠計算出來。好比 sizeclass = 3, elemsize = 32 byte。對於大於 32Kb 的內存分配，都是分配整數頁，elemsize = page_size * npages。
nelems: span 中包塊的總數目。
freeindex: 0 ~ nelemes-1，表示分配到第幾個塊。

2.3 mcentral

上面說到當 mcache 不夠用的時候，會從 mcentral 申請。那咱們下面就來介紹一下 mcentral。

type mcentral struct {
    lock      mutex
    sizeclass int32
    nonempty  mSpanList // list of spans with a free object, ie a nonempty free list
    empty     mSpanList // list of spans with no free objects (or cached in an mcache)
}

type mSpanList struct {
    first *mspan
    last  *mspan
}

mcentral 分析：

sizeclass: 也有成員 sizeclass，那麼 mcentral 是否是也有 67 個呢？是的。
lock: 由於會有多個 P 過來競爭。
nonempty: mspan 的雙向鏈表，當前 mcentral 中可用的 mspan list。
empty: 已經被使用的，能夠認爲是一種對全部 mspan 的 track。
問題來了，mcentral 存在於什麼地方？雖然在上面咱們將 mcentral 和 mheap 做爲兩個部分來說，可是做爲全局的結構，這兩部分是能夠定義在一塊兒的。實際上也是這樣，mcentral 包含在 mheap 中。

2.4 mheap

Golang 中的 mheap 結構定義以下。

type mheap struct {
lock mutex
free [_MaxMHeapList]mSpanList // free lists of given length
freelarge mSpanList // free lists length >= _MaxMHeapList
busy [_MaxMHeapList]mSpanList // busy lists of large objects of given length
busylarge mSpanList // busy lists of large objects length >= _MaxMHeapList
sweepgen uint32 // sweep generation, see comment in mspan
sweepdone uint32 // all spans are swept

// allspans is a slice of all mspans ever created. Each mspan
// appears exactly once.
//
// The memory for allspans is manually managed and can be
// reallocated and move as the heap grows.
//
// In general, allspans is protected by mheap_.lock, which
// prevents concurrent access as well as freeing the backing
// store. Accesses during STW might not hold the lock, but
// must ensure that allocation cannot happen around the
// access (since that may free the backing store).
allspans []*mspan // all spans out there

// spans is a lookup table to map virtual address page IDs to *mspan.
// For allocated spans, their pages map to the span itself.
// For free spans, only the lowest and highest pages map to the span itself.
// Internal pages map to an arbitrary span.
// For pages that have never been allocated, spans entries are nil.
//
// This is backed by a reserved region of the address space so
// it can grow without moving. The memory up to len(spans) is
// mapped. cap(spans) indicates the total reserved memory.
spans []*mspan

// sweepSpans contains two mspan stacks: one of swept in-use
// spans, and one of unswept in-use spans. These two trade
// roles on each GC cycle. Since the sweepgen increases by 2
// on each cycle, this means the swept spans are in
// sweepSpans[sweepgen/2%2] and the unswept spans are in
// sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
// unswept stack and pushes spans that are still in-use on the
// swept stack. Likewise, allocating an in-use span pushes it
// on the swept stack.
sweepSpans [2]gcSweepBuf

_ uint32 // align uint64 fields on 32-bit for atomics

// Proportional sweep
pagesInUse uint64 // pages of spans in stats _MSpanInUse; R/W with mheap.lock
spanBytesAlloc uint64 // bytes of spans allocated this cycle; updated atomically
pagesSwept uint64 // pages swept this cycle; updated atomically
sweepPagesPerByte float64 // proportional sweep ratio; written with lock, read without
// TODO(austin): pagesInUse should be a uintptr, but the 386
// compiler can't 8-byte align fields.

// Malloc stats.
largefree uint64 // bytes freed for large objects (>maxsmallsize)
nlargefree uint64 // number of frees for large objects (>maxsmallsize)
nsmallfree [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)

// range of addresses we might see in the heap
bitmap uintptr // Points to one byte past the end of the bitmap
bitmap_mapped uintptr
arena_start uintptr
arena_used uintptr // always mHeap_Map{Bits,Spans} before updating
arena_end uintptr
arena_reserved bool

// central free lists for small size classes.
// the padding makes sure that the MCentrals are
// spaced CacheLineSize bytes apart, so that each MCentral.lock
// gets its own cache line.
central [_NumSizeClasses]struct {
mcentral mcentral
pad [sys.CacheLineSize]byte
}

spanalloc fixalloc // allocator for span*
cachealloc fixalloc // allocator for mcache*
specialfinalizeralloc fixalloc // allocator for specialfinalizer*
specialprofilealloc fixalloc // allocator for specialprofile*
speciallock mutex // lock for special record allocators.
}
var mheap_ mheap

mheap_ 是一個全局變量，會在系統初始化的時候初始化（在函數 mallocinit() 中）。咱們先看一下 mheap 具體結構。

allspans []*mspan: 全部的 spans 都是經過 mheap_ 申請，全部申請過的 mspan 都會記錄在 allspans。結構體中的 lock 就是用來保證併發安全的。註釋中有關於 STW 的說明，這個以後會在 Golang 的 GC 文章中細說。

central [_NumSizeClasses]…: 這個就是以前介紹的 mcentral ，每種大小的塊對應一個 mcentral。mcentral 上面介紹過了。pad 能夠認爲是一個字節填充，爲了不僞共享（false sharing）問題的。False Sharing 能夠參考 False Sharing - wikipedia，這裏就不細說了。

sweepgen, sweepdone: GC 相關。（Golang 的 GC 策略是 Mark & Sweep, 這裏是用來表示 sweep 的，這裏就再也不深刻了。）

free [_MaxMHeapList]mSpanList: 這是一個 SpanList 數組，每一個 SpanList 裏面的 mspan 由 1 ~ 127 (_MaxMHeapList - 1) 個 page 組成。好比 free[3] 是由包含 3 個 page 的 mspan 組成的鏈表。free 表示的是 free list，也就是未分配的。對應的還有 busy list。

freelarge mSpanList: mspan 組成的鏈表，每一個元素（也就是 mspan）的 page 個數大於 127。對應的還有 busylarge。

spans []*mspan: 記錄 arena 區域頁號（page number）和 mspan 的映射關係。

arena_start, arena_end, arena_used: 要解釋這幾個變量以前要解釋一下 arena。arena 是 Golang 中用於分配內存的連續虛擬地址區域。由 mheap 管理，堆上申請的全部內存都來自 arena。那麼如何標誌內存可用呢？操做系統的常見作法用兩種：一種是用鏈表將全部的可用內存都串起來；另外一種是使用位圖來標誌內存塊是否可用。結合上面一條 spans，內存的佈局是下面這樣的。

+-----------------------+---------------------+-----------------------+
| spans | bitmap | arena |
+-----------------------+---------------------+-----------------------+
spanalloc, cachealloc fixalloc: fixalloc 是 free-list，用來分配特定大小的塊。

剩下的是一些統計信息和 GC 相關的信息，這裏暫且按住不表，之後專門拿出來講。

3. 初始化

在系統初始化階段，上面介紹的幾個結構會被進行初始化，咱們直接看一下初始化代碼：mallocinit()。

func mallocinit() {
    //一些系統檢測代碼，略去
    var p, bitmapSize, spansSize, pSize, limit uintptr
    var reserved bool

    // limit = runtime.memlimit();
    // See https://golang.org/issue/5049
    // TODO(rsc): Fix after 1.1.
    limit = 0
  
    //系統指針大小 PtrSize = 8，表示這是一個 64 位系統。
    if sys.PtrSize == 8 && (limit == 0 || limit > 1<<30) {
        //這裏的 arenaSize, bitmapSize, spansSize 分別對應 mheap 那一小節裏面提到 arena 區大小，bitmap 區大小，spans 區大小。
        arenaSize := round(_MaxMem, _PageSize)
        bitmapSize = arenaSize / (sys.PtrSize * 8 / 2)
        spansSize = arenaSize / _PageSize * sys.PtrSize
        spansSize = round(spansSize, _PageSize)
        //嘗試從不一樣地址開始申請
        for i := 0; i <= 0x7f; i++ {
            switch {
            case GOARCH == "arm64" && GOOS == "darwin":
                p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
            case GOARCH == "arm64":
                p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
            default:
                p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
            }
            pSize = bitmapSize + spansSize + arenaSize + _PageSize
            //向 OS 申請大小爲 pSize 的連續的虛擬地址空間
            p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
            if p != 0 {
                break
            }
        }
    }
    //這裏是 32 位系統代碼對應的操做，略去。
    ...
    
    p1 := round(p, _PageSize)

    spansStart := p1
    mheap_.bitmap = p1 + spansSize + bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1 + (spansSize + bitmapSize)
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1 + (spansSize + bitmapSize)
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    _g_ := getg()
    _g_.m.mcache = allocmcache()
}

上面對代碼作了簡單的註釋，下面詳細解說其中的部分功能函數。

3.1 arena 相關

arena 相關地址的大小初始化代碼以下。

arenaSize := round(_MaxMem, _PageSize)
bitmapSize = arenaSize / (sys.PtrSize * 8 / 2)
spansSize = arenaSize / _PageSize * sys.PtrSize
spansSize = round(spansSize, _PageSize)

_MaxMem = uintptr(1<<_MHeapMap_TotalBits - 1)

首先解釋一下變量 _MaxMem ，裏面還有一個變量就再也不列出來了。簡單來講 _MaxMem 就是系統爲 arena 區分配的大小：64 位系統分配 512 G；對於 Windows 64 位系統，arena 區分配 32 G。round 是一個對齊操做，向上取 _PageSize 的倍數。實現也頗有意思，代碼以下。

// round n up to a multiple of a.  a must be a power of 2.
func round(n, a uintptr) uintptr {
    return (n + a - 1) &^ (a - 1)
}

bitmap 用兩個 bit 表示一個字的可用狀態，那麼算下來 bitmap 的大小爲 16 G。讀過 Golang 源碼的同窗會發現其實這段代碼的註釋裏寫的 bitmap 的大小爲 32 G。實際上是這段註釋好久沒有更新了，以前是用 4 個 bit 來表示一個字的可用狀態，這真是一個悲傷的故事啊。

spans 記錄的 arena 區的塊頁號和對應的 mspan 指針的對應關係。好比 arena 區內存地址 x，對應的頁號就是 page_num = (x - arena_start) / page_size，那麼 spans 就會記錄 spans[page_num] = x。若是 arena 爲 512 G的話，spans 區的大小爲 512 G / 8K * 8 = 512 M。這裏值得注意的是 Golang 的內存管理虛擬地址頁大小爲 8k。

_PageSize = 1 << _PageShift

_PageShift = 13

因此這一段連續的的虛擬內存佈局（64 位）以下：

+-----------------------+---------------------+-----------------------+
| spans 512M | bitmap 16G | arena 512 |
+-----------------------+---------------------+-----------------------+
3.2 虛擬地址申請
主要是下面這段代碼。

//嘗試從不一樣地址開始申請

for i := 0; i <= 0x7f; i++ {
    switch {
    case GOARCH == "arm64" && GOOS == "darwin":
        p = uintptr(i)<<40 | uintptrMask&(0x0013<<28)
    case GOARCH == "arm64":
        p = uintptr(i)<<40 | uintptrMask&(0x0040<<32)
    default:
        p = uintptr(i)<<40 | uintptrMask&(0x00c0<<32)
    }
    pSize = bitmapSize + spansSize + arenaSize + _PageSize
    //向 OS 申請大小爲 pSize 的連續的虛擬地址空間
    p = uintptr(sysReserve(unsafe.Pointer(p), pSize, &reserved))
    if p != 0 {
        break
    }
}

初始化的時候，Golang 向操做系統申請一段連續的地址空間，就是上面的 spans + bitmap + arena。p 就是這段連續地址空間的開始地址，不一樣平臺的 p 取值不同。像 OS 申請的時候視不一樣的 OS 版本，調用不一樣的系統調用，好比 Unix 系統調用 mmap (mmap 想操做系統內核申請新的虛擬地址區間，可指定起始地址和長度)，Windows 系統調用 VirtualAlloc （相似 mmap）。

//bsd
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    if sys.PtrSize == 8 && uint64(n) > 1<<32 || sys.GoosNacl != 0 {
        *reserved = false
        return v
    }

    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}

//darwin
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    return p
}

//linux
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    ...
    p := mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0)
    if uintptr(p) < 4096 {
        return nil
    }
    *reserved = true
    return p
}
//windows
func sysReserve(v unsafe.Pointer, n uintptr, reserved *bool) unsafe.Pointer {
    *reserved = true
    // v is just a hint.
    // First try at v.
    v = unsafe.Pointer(stdcall4(_VirtualAlloc, uintptr(v), n, _MEM_RESERVE, _PAGE_READWRITE))
    if v != nil {
        return v
    }

    // Next let the kernel choose the address.
    return unsafe.Pointer(stdcall4(_VirtualAlloc, 0, n, _MEM_RESERVE, _PAGE_READWRITE))
}

3.3 mheap 初始化

咱們上面介紹 mheap 結構的時候知道 spans, bitmap, arena 都是存在於 mheap 中的，從操做系統申請完地址以後就是初始化 mheap 了。

func mallocinit() {
    ...
    p1 := round(p, _PageSize)

    spansStart := p1
    mheap_.bitmap = p1 + spansSize + bitmapSize
    if sys.PtrSize == 4 {
        // Set arena_start such that we can accept memory
        // reservations located anywhere in the 4GB virtual space.
        mheap_.arena_start = 0
    } else {
        mheap_.arena_start = p1 + (spansSize + bitmapSize)
    }
    mheap_.arena_end = p + pSize
    mheap_.arena_used = p1 + (spansSize + bitmapSize)
    mheap_.arena_reserved = reserved

    if mheap_.arena_start&(_PageSize-1) != 0 {
        println("bad pagesize", hex(p), hex(p1), hex(spansSize), hex(bitmapSize), hex(_PageSize), "start", hex(mheap_.arena_start))
        throw("misrounded allocation in mallocinit")
    }

    // Initialize the rest of the allocator.
    mheap_.init(spansStart, spansSize)
    //獲取當前 G
    _g_ := getg()
    // 獲取 G 上綁定的 M 的 mcache
    _g_.m.mcache = allocmcache()
}

p 是從連續虛擬地址的起始地址，先進行對齊，而後初始化 arena，bitmap，spans 地址。mheap_.init()會初始化 fixalloc 等相關的成員，還有 mcentral 的初始化。

func (h *mheap) init(spansStart, spansBytes uintptr) {
    h.spanalloc.init(unsafe.Sizeof(mspan{}), recordspan, unsafe.Pointer(h), &memstats.mspan_sys)
    h.cachealloc.init(unsafe.Sizeof(mcache{}), nil, nil, &memstats.mcache_sys)
    h.specialfinalizeralloc.init(unsafe.Sizeof(specialfinalizer{}), nil, nil, &memstats.other_sys)
    h.specialprofilealloc.init(unsafe.Sizeof(specialprofile{}), nil, nil, &memstats.other_sys)

    h.spanalloc.zero = false

    // h->mapcache needs no init
    for i := range h.free {
        h.free[i].init()
        h.busy[i].init()
    }

    h.freelarge.init()
    h.busylarge.init()
    for i := range h.central {
        h.central[i].mcentral.init(int32(i))
    }

    sp := (*slice)(unsafe.Pointer(&h.spans))
    sp.array = unsafe.Pointer(spansStart)
    sp.len = 0
    sp.cap = int(spansBytes / sys.PtrSize)
}

mheap 初始化以後，對當前的線程也就是 M 進行初始化。

//獲取當前 G
g := getg()
// 獲取 G 上綁定的 M 的 mcache
g.m.mcache = allocmcache()

3.4 per-P mcache 初始化

上面好像並無說到針對 P 的 mcache 初始化，由於這個時候尚未初始化 P。咱們看一下 bootstrap 的代碼。

func schedinit() {
    ...
    mallocinit()
    ...
    
    if procs > _MaxGomaxprocs {
        procs = _MaxGomaxprocs
    }
    if procresize(procs) != nil {
        ...
    }
}

其中 mallocinit() 上面說過了。對 P 的初始化在函數 procresize() 中執行，咱們下面只看內存相關的部分。

func procresize(nprocs int32) *p {
    ...
    // initialize new P's
    for i := int32(0); i < nprocs; i++ {
        pp := allp[i]
        if pp == nil {
            pp = new(p)
            pp.id = i
            pp.status = _Pgcstop
            pp.sudogcache = pp.sudogbuf[:0]
            for i := range pp.deferpool {
                pp.deferpool[i] = pp.deferpoolbuf[i][:0]
            }
            atomicstorep(unsafe.Pointer(&allp[i]), unsafe.Pointer(pp))
        }
        // P mcache 初始化
        if pp.mcache == nil {
            if old == 0 && i == 0 {
                if getg().m.mcache == nil {
                    throw("missing mcache?")
                }
                // P[0] 分配給主 Goroutine
                pp.mcache = getg().m.mcache // bootstrap
            } else {
                // P[0] 以外的 P 申請 mcache
                pp.mcache = allocmcache()
            }
        }
        ...
    }
    ...
}

全部的 P 都存放在一個全局數組 allp 中，procresize() 的目的就是將 allp 中用到的 P 進行初始化，同時對多餘 P 的資源剝離。

4. 內存分配

先說一下給對象 object 分配內存的主要流程：

object size > 32K，則使用 mheap 直接分配。
object size < 16 byte，使用 mcache 的小對象分配器 tiny 直接分配。（其實 tiny 就是一個指針，暫且這麼說吧。）
object size > 16 byte && size <=32K byte 時，先使用 mcache 中對應的 size class 分配。
若是 mcache 對應的 size class 的 span 已經沒有可用的塊，則向 mcentral 請求。
若是 mcentral 也沒有可用的塊，則向 mheap 申請，並切分。
若是 mheap 也沒有合適的 span，則想操做系統申請。
咱們看一下在堆上，也就是 arena 區分配內存的相關函數。

package main

func foo() *int {
    x := 1
    return &x
}

func main() {
    x := foo()
    println(*x)
}

根據以前介紹的逃逸分析，foo() 中的 x 會被分配到堆上。把上面代碼保存爲 test1.go 看一下彙編代碼。

$ go build -gcflags '-l' -o test1 test1.go
$ go tool objdump -s "main\.foo" test1
TEXT main.foo(SB) /Users/didi/code/go/malloc_example/test2.go
    test2.go:3  0x2040  65488b0c25a0080000  GS MOVQ GS:0x8a0, CX
    test2.go:3  0x2049  483b6110        CMPQ 0x10(CX), SP
    test2.go:3  0x204d  762a            JBE 0x2079
    test2.go:3  0x204f  4883ec10        SUBQ $0x10, SP
    test2.go:4  0x2053  488d1d66460500      LEAQ 0x54666(IP), BX
    test2.go:4  0x205a  48891c24        MOVQ BX, 0(SP)
    test2.go:4  0x205e  e82d8f0000      CALL runtime.newobject(SB)
    test2.go:4  0x2063  488b442408      MOVQ 0x8(SP), AX
    test2.go:4  0x2068  48c70001000000      MOVQ $0x1, 0(AX)
    test2.go:5  0x206f  4889442418      MOVQ AX, 0x18(SP)
    test2.go:5  0x2074  4883c410        ADDQ $0x10, SP
    test2.go:5  0x2078  c3          RET
    test2.go:3  0x2079  e8a28d0400      CALL runtime.morestack_noctxt(SB)
    test2.go:3  0x207e  ebc0            JMP main.foo(SB)

堆上內存分配調用了 runtime 包的 newobject 函數。

func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.size, typ, true)
}

func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    ... 
    c := gomcache()
    var x unsafe.Pointer
    noscan := typ == nil || typ.kind&kindNoPointers != 0
    if size <= maxSmallSize {
        // object size <= 32K
        if noscan && size < maxTinySize {
            // 小於 16 byte 的小對象分配
            off := c.tinyoffset
            // Align tiny pointer for required (conservative) alignment.
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // Allocate a new maxTinySize block.
            span := c.alloc[tinySizeClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySizeClass)
            }
            x = unsafe.Pointer(v)
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
        } else {
            // object size >= 16 byte  && object size <= 32K byte
            var sizeclass uint8
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            span := c.alloc[sizeclass]
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(sizeclass)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
    } else {
        //object size > 32K byte
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
}

整個分配過程能夠根據 object size 拆解成三部分：size < 16 byte, 16 byte <= size <= 32 K byte, size > 32 K byte。

4.1 size 小於 16 byte

對於小於 16 byte 的內存塊，mcache 有個專門的內存區域 tiny 用來分配，tiny 是指針，指向開始地址。

func mallocgc(...) {
    ...
            off := c.tinyoffset
            // 地址對齊
            if size&7 == 0 {
                off = round(off, 8)
            } else if size&3 == 0 {
                off = round(off, 4)
            } else if size&1 == 0 {
                off = round(off, 2)
            }
            // 分配
            if off+size <= maxTinySize && c.tiny != 0 {
                // The object fits into existing tiny block.
                x = unsafe.Pointer(c.tiny + off)
                c.tinyoffset = off + size
                c.local_tinyallocs++
                mp.mallocing = 0
                releasem(mp)
                return x
            }
            // tiny 不夠了，爲其從新分配一個 16 byte 內存塊
            span := c.alloc[tinySizeClass]
            v := nextFreeFast(span)
            if v == 0 {
                v, _, shouldhelpgc = c.nextFree(tinySizeClass)
            }
            x = unsafe.Pointer(v)
            //將申請的內存塊全置爲 0
            (*[2]uint64)(x)[0] = 0
            (*[2]uint64)(x)[1] = 0
            // See if we need to replace the existing tiny block with the new one
            // based on amount of remaining free space.
            // 若是申請的內存塊用不完，則將剩下的給 tiny，用 tinyoffset 記錄分配了多少。
            if size < c.tinyoffset || c.tiny == 0 {
                c.tiny = uintptr(x)
                c.tinyoffset = size
            }
            size = maxTinySize
}

如上所示，tinyoffset 表示 tiny 當前分配到什麼地址了，以後的分配根據 tinyoffset 尋址。先根據要分配的對象大小進行地址對齊，好比 size 是 8 的倍數，tinyoffset 和 8 對齊。而後就是進行分配。若是 tiny 剩餘的空間不夠用，則從新申請一個 16 byte 的內存塊，並分配給 object。若是有結餘，則記錄在 tiny 上。

4.2 size 大於 32 K byte

對於大於 32 Kb 的內存分配，直接跳過 mcache 和 mcentral，經過 mheap 分配。

func mallocgc(...) {
    } else {
        var s *mspan
        shouldhelpgc = true
        systemstack(func() {
            s = largeAlloc(size, needzero)
        })
        s.freeindex = 1
        s.allocCount = 1
        x = unsafe.Pointer(s.base())
        size = s.elemsize
    }
    ...
}

func largeAlloc(size uintptr, needzero bool) *mspan {
    ...
    npages := size >> _PageShift
    if size&_PageMask != 0 {
        npages++
    }
    ...
    s := mheap_.alloc(npages, 0, true, needzero)
    if s == nil {
        throw("out of memory")
    }
    s.limit = s.base() + size
    heapBitsForSpan(s.base()).initSpan(s)
    return s
}

對於大於 32 K 的內存分配都是分配整數頁，先右移而後低位與計算須要的頁數。

4.3 size 介於 16 和 32K

對於 size 介於 16 ~ 32K byte 的內存分配先計算應該分配的 sizeclass，而後去 mcache 裏面 alloc[sizeclass] 申請，若是 mcache.alloc[sizeclass] 不足以申請，則 mcache 向 mcentral 申請，而後再分配。mcentral 給 mcache 分配完以後會判斷本身需不須要擴充，若是須要則想 mheap 申請。

func mallocgc(...) {
        ...
        } else {
            var sizeclass uint8
            
            //計算 sizeclass
            if size <= smallSizeMax-8 {
                sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
            } else {
                sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-       1)/largeSizeDiv]
            }
            size = uintptr(class_to_size[sizeclass])
            span := c.alloc[sizeclass]
            //從對應的 span 裏面分配一個 object 
            v := nextFreeFast(span)
            if v == 0 {
                v, span, shouldhelpgc = c.nextFree(sizeclass)
            }
            x = unsafe.Pointer(v)
            if needzero && span.needzero != 0 {
                memclrNoHeapPointers(unsafe.Pointer(v), size)
            }
        }
}

咱們首先看一下如何計算 sizeclass 的，預先定義了兩個數組：size_to_class8 和 size_to_class128。數組 size_to_class8，其第 i 個值表示地址區間 ( (i-1)8, i8 ] (smallSizeDiv = 8) 對應的 sizeclass，size_to_class128 相似。小於 1024 - 8 = 1016 （smallSizeMax=1024），使用 size_to_class8，不然使用數組 size_to_class128。看一下數組具體的值：0, 1, 2, 3, 3, 4, 4…。舉個例子，好比要分配 17 byte 的內存（16 byte 如下的使用 mcache.tiny 分配），sizeclass = size_to_calss8[(17+7)/8] = size_to_class8[3] = 3。不得不說這種用空間換時間的策略確實提升了運行效率。

計算出 sizeclass，那麼就能夠去 mcache.alloc[sizeclass] 分配了，注意這是一個 mspan 指針，真正的分配函數是 nextFreeFast() 函數。以下。

// nextFreeFast returns the next free object if one is quickly available.
// Otherwise it returns 0.
func nextFreeFast(s *mspan) gclinkptr {
    theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache?
    if theBit < 64 {
        result := s.freeindex + uintptr(theBit)
        if result < s.nelems {
            freeidx := result + 1
            if freeidx%64 == 0 && freeidx != s.nelems {
                return 0
            }
            s.allocCache >>= (theBit + 1)
            s.freeindex = freeidx
            v := gclinkptr(result*s.elemsize + s.base())
            s.allocCount++
            return v
        }
    }
    return 0
}

allocCache 這裏是用位圖表示內存是否可用，1 表示可用。而後經過 span 裏面的 freeindex 和 elemsize 來計算地址便可。

若是 mcache.alloc[sizeclass] 已經不夠用了，則從 mcentral 申請內存到 mcache。

// nextFree returns the next free object from the cached span if one is available.
// Otherwise it refills the cache with a span with an available object and
// returns that object along with a flag indicating that this was a heavy
// weight allocation. If it is a heavy weight allocation the caller must
// determine whether a new GC cycle needs to be started or if the GC is active
// whether this goroutine needs to assist the GC.
func (c *mcache) nextFree(sizeclass uint8) (v gclinkptr, s *mspan, shouldhelpgc bool) {
    s = c.alloc[sizeclass]
    shouldhelpgc = false
    freeIndex := s.nextFreeIndex()
    if freeIndex == s.nelems {
        // The span is full.
        if uintptr(s.allocCount) != s.nelems {
            println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems)
            throw("s.allocCount != s.nelems && freeIndex == s.nelems")
        }
        systemstack(func() {
            // 這個地方 mcache 向 mcentral 申請
            c.refill(int32(sizeclass))
        })
        shouldhelpgc = true
        s = c.alloc[sizeclass]
        // mcache 向 mcentral 申請完以後，再次從 mcache 申請
        freeIndex = s.nextFreeIndex()
    }

    ...
}

// nextFreeIndex returns the index of the next free object in s at
// or after s.freeindex.
// There are hardware instructions that can be used to make this
// faster if profiling warrants it.
// 這個函數和 nextFreeFast 有點冗餘了
func (s *mspan) nextFreeIndex() uintptr {
    ...
}
mcache 向 mcentral，若是 mcentral 不夠，則向 mheap 申請。

func (c *mcache) refill(sizeclass int32) *mspan {
    ...
    // 向 mcentral 申請
    s = mheap_.central[sizeclass].mcentral.cacheSpan()
    ...
    return s
}

// Allocate a span to use in an MCache.
func (c *mcentral) cacheSpan() *mspan {
    ...
    // Replenish central list if empty.
    s = c.grow()
}

func (c *mcentral) grow() *mspan {
    npages := uintptr(class_to_allocnpages[c.sizeclass])
    size := uintptr(class_to_size[c.sizeclass])
    n := (npages << _PageShift) / size
    
    //這裏想 mheap 申請
    s := mheap_.alloc(npages, c.sizeclass, false, true)
    ...
    return s
}

若是 mheap 不足，則想 OS 申請。接上面的代碼 mheap_.alloc()

func (h *mheap) alloc(npage uintptr, sizeclass int32, large bool, needzero bool) *mspan {
    ...
    var s *mspan
    systemstack(func() {
        s = h.alloc_m(npage, sizeclass, large)
    })
    ...
}

func (h *mheap) alloc_m(npage uintptr, sizeclass int32, large bool) *mspan {
    ... 
    s := h.allocSpanLocked(npage)
    ...
}

func (h *mheap) allocSpanLocked(npage uintptr) *mspan {
    ...
    s = h.allocLarge(npage)
    if s == nil {
        if !h.grow(npage) {
            return nil
        }
        s = h.allocLarge(npage)
        if s == nil {
            return nil
        }
    }
    ...
}

func (h *mheap) grow(npage uintptr) bool {
    // Ask for a big chunk, to reduce the number of mappings
    // the operating system needs to track; also amortizes
    // the overhead of an operating system mapping.
    // Allocate a multiple of 64kB.
    npage = round(npage, (64<<10)/_PageSize)
    ask := npage << _PageShift
    if ask < _HeapAllocChunk {
        ask = _HeapAllocChunk
    }

    v := h.sysAlloc(ask)
    ...
}

整個函數調用鏈如上所示，最後 sysAlloc 會調用系統調用（mmap 或者 VirtualAlloc，和初始化那部分有點相似）去向操做系統申請。

5. 內存回收

這裏只會介紹一些簡單的內存回收，更詳細的垃圾回收以後會單獨寫一篇文章來說。

5.1 mcache 回收

mcache 回收能夠分兩部分：第一部分是將 alloc 中未用完的內存歸還給對應的 mcentral。

func freemcache(c *mcache) {
    systemstack(func() {
        c.releaseAll()
        ...

        lock(&mheap_.lock)
        purgecachedstats(c)
        mheap_.cachealloc.free(unsafe.Pointer(c))
        unlock(&mheap_.lock)
    })
}

func (c *mcache) releaseAll() {
    for i := 0; i < _NumSizeClasses; i++ {
        s := c.alloc[i]
        if s != &emptymspan {
            mheap_.central[i].mcentral.uncacheSpan(s)
            c.alloc[i] = &emptymspan
        }
    }
    // Clear tinyalloc pool.
    c.tiny = 0
    c.tinyoffset = 0
}

函數 releaseAll() 負責將 mcache.alloc 中各個 sizeclass 中的 mspan 歸還給 mcentral。這裏須要注意的是歸還給 mcentral 的時候須要加鎖，由於 mcentral 是全局的。除此以外將剩下的 mcache （基本是個空殼）歸還給 mheap.cachealloc，其實就是把 mcache 插入 free list 表頭。

func (f *fixalloc) free(p unsafe.Pointer) {
    f.inuse -= f.size
    v := (*mlink)(p)
    v.next = f.list
    f.list = v
}

5.2 mcentral 回收

當 mspan 沒有 free object 的時候，將 mspan 歸還給 mheap。

func (c *mcentral) freeSpan(s *mspan, preserve bool, wasempty bool) bool {
    ...
    lock(&c.lock)
    ...
    if s.allocCount != 0 {
        unlock(&c.lock)
        return false
    }

    c.nonempty.remove(s)
    unlock(&c.lock)
    mheap_.freeSpan(s, 0)
    return true
}

5.3 mheap

mheap 並不會定時向操做系統歸還，可是會對 span 作一些操做，好比合並相鄰的 span。

6. 總結

tcmalloc 是一種理論，運用到實踐中還要考慮工程實現的問題。學習 Golang 源碼的過程當中，除了知道它是如何工做的以外，還能夠學習到不少有趣的知識，好比使用變量填充 CacheLine 避免 False Sharing，利用 debruijn 序列求解 Trailing Zero（在函數中 sys.Ctz64 使用）等等。我想這就是讀源碼的意義所在吧。