排查 Node.js 服務內存泄漏，沒想到竟是它？

時間 2021-08-13

標籤 node git es6 github chrome typescript npm json c# 數組欄目 Node.js 简体版

原文原文鏈接

背景

團隊最近將兩個項目遷移至 degg 2.0 中，兩個項目均出現比較嚴重的內存泄漏問題，此處以本人維護的埋點服務爲例進行排查。服務上線後內存增加以下圖，其中紅框爲 degg 2.0 線上運行的時間窗口，在短短 36 小時內，內存已經增加到 50%，而平時內存穩定在 20%-30%，可知十之八九出現了內存泄漏。node

排查思路

因爲兩個接入 degg 2.0 的服務均出現內存泄漏問題，所以初步將排查範圍鎖定在 degg 2.0引入或重寫的基礎組件上，重點懷疑對象爲 nodex-logger 組件；同時爲了排查內存泄漏，咱們須要獲取服務運行進程的堆快照（heapsnapshot），獲取方式可參看文章 hyj1991：Node 案發現場揭祕 —— 快速定位線上內存泄漏https://zhuanlan.zhihu.com/p/36340263 。git

排查過程

1、獲取堆快照

使用 alinode 獲取堆快照，服務啓動後，使用小流量預熱一兩分鐘便記錄第1份堆快照（2020-4-16-16:52），接着設置 qps 爲 125 對服務進行施壓，通過大約一個小時（2020-4-16-15:46）獲取第2份堆快照。使用 Chrome dev工具載入兩份堆快照，以下圖所示，發現服務僅短短運行一小時，其堆快照文件就增大了 45MB，而初始大小也不過 39.7MB；咱們按 Retained Size 列進行排序，很快就發現了一個『嫌疑犯』，即 generator；該項佔用了 55% 的大小，同時 Shallow Size 卻爲 0%，一項一項地展開，鎖定到了圖中高亮的這行，可是繼續展開卻提示 0%，線索忽然斷了。es6

盯着 generator 進入思考，個人服務代碼並無generator 語法，爲何會出現 generator 對象的內存泄漏呢？此時我把注意力轉到 node_modules 目錄中，因爲最近一直在優化 nodex-kafka 組件，有時直接在 node_modules 目錄中修改該組件的代碼進行調試，所以幾乎每一個文件頭部都有的一段代碼引發了個人注意：github

"use strict";
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
    function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
    return new (P || (P = Promise))(function (resolve, reject) {
        function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
        function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
        function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
        step((generator = generator.apply(thisArg, _arguments || [])).next());
    });
};

這個代碼是 typescript 源碼編譯後的產出，因爲代碼使用了 async/await 語法，所以都編譯成 __awaiter 的形式，在源碼中使用 async 函數的地方，在編譯後都使用 __awaiter 進行包裹：chrome

// 編譯前
(async function() {
  await Promise.resolve(1);
  await Promise.resolve(2);
})()

// 編譯後
(function () {
  return __awaiter(this, void 0, void 0, function* () {
    yield Promise.resolve(1);
    yield Promise.resolve(2);
  });
})();

同時一個關於 generator 內存泄漏的 #30753 generator functions - memory leak https://github.com/nodejs/node/issues/30753 也引發了個人注意，該 issue 遇到的問題不管從 Node.js 的版本和內存泄漏的表現都和我遇到的問題十分類似。因此我在工程的 node_modules 中搜索全部 __awaiter 字符串，發現了 3 個模塊編譯出了上述代碼，分別是：typescript

nodex-logger
nodex-kafka
nodex-apollo

因爲模塊的 tsconfig.json 的 target 字段將目標產出爲es6，所以纔會使用 generator去模擬 async/await 語法，可是從 Node.js v8.10.0 開始已經 100% 支持了 ES2017 的全部特性，因此本不應編譯 async/await 語法，此處遂將這 3 個模塊的目標產出配置改成 es2017，這樣 tsc 就不會編譯 async/await 語法。npm

2、驗證

重複以前獲取堆快照的步驟，驚奇地發現即便過了一天，內存也沒有增加，並且 generator 也沒有持有未釋放的內存：json

至此，內存泄漏問題已經解決！那麼如何避免遇到這個問題呢？c#

如何避免

1、解決步驟

步驟一數組

該問題僅在特定的 Node.js 版本中存在，請使用版本區間 (v11.0.0 - v12.16.0) 以外的 Node.js，從而防止二方 npm 組件、三方 npm 組件的 generator 語法使你的服務出問題

步驟二將本身的 typescript 的目標環境（target）編譯爲 es2017 及以上，同時應儘可能使用 async/await 語法而不是 generator 語法，從而防止別人使用 (v11.0.0 - v12.16.0) 版本時，引入你的 npm 組件而致使內存泄漏

2、詳細說明

前文說了從 Node.js v8.10.0 開始就已經支持了 async/await 語法，經查該版本於 2018-03-06 發佈，因爲全部服務也不可能一下全切換到新版本，所以爲了兼容 Node.js v6 版本的環境，須要將代碼編譯到 es6。可是站在如今這個 LTS 版本已是 v12 的時間節點，徹底能夠排查現有使用 typescript 的 npm 組件是否都編譯到 es2017，甚至探討編譯到 es2019 的可能。

此外這個內存泄漏問題是從哪一個版本開始有的，如今是否解決了呢？編寫可驗證的內存泄漏的代碼以下：

// no-leak.js
const heapdump = require('heapdump')

class Async {
  async run() {
      return null;
  }
}
const run = async () => {
  for (let index = 0; index < 10000000; index++) {
      if (index % 1000000 === 0)
          console.log(Math.floor(process.memoryUsage().heapUsed / 10000), index);
      const doer = new Async();
      await doer.run();
  }
  heapdump.writeSnapshot((err, filename) => {
    console.log("Heap dump written to", filename);
  });
};
run();
// leak.js 由 no-leak.js 編譯得來
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
    function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
    return new (P || (P = Promise))(function (resolve, reject) {
        function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
        function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
        function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
        step((generator = generator.apply(thisArg, _arguments || [])).next());
    });
};
class Async {
    run() {
        return __awaiter(this, void 0, void 0, function* () {
            return null;
        });
    }
}
const run = () => __awaiter(this, void 0, void 0, function* () {
    const now = Date.now();
    console.log('循環總次數: ', 10000000);
     for (let index = 0; index < 10000000; index++) {
        if (index % 1000000 === 0) {
            console.log('第 %d 次循環，此時內存爲 %d', index, Math.floor(process.memoryUsage().heapUsed / 1000000));
        }
        const instance = new Async();
        yield instance.run();
    }
    console.log('總耗時: %d 秒', (Date.now() - now) / 1000);
});
run();

通過二分排查，發現該泄漏問題從 v11.0.0 引入，在 v12.16.0 解決；內存泄漏版本執行腳本時，內存佔用逐步遞增直到 crash，而未泄漏版本則會及時回收內存。

根本緣由

根本緣由是 v8 的一個 bug，相關連接：

v8 issue: https://bugs.chromium.org/p/v8/issues/detail?id=10031

v8 commit: https://chromium.googlesource.com/v8/v8.git/+/d3a1a5b6c4916f22e076e3349ed3619bfb014f29

node issue: https://github.com/nodejs/node/issues/30753

node commit: https://github.com/nodejs/node/pull/31005/files

改進後的代碼，在分配新增WeakArrayList 數組時，即便返回沒有空閒數組的標記（ kNoEmptySlotsMarker ），仍須要調用 ScanForEmptySlots 方法從新掃描一次數組，由於該數組元素有可能有被 GC 回收，這些被回收的元素是能夠重複使用的；僅當返回 kNoEmptySlotsMarker 且數組中沒有被 GC 回收的元素，才真正執行新增邏輯：

// https://github.com/targos/node/blob/cceb2a87295724b7aa843363460ffcd10cda05b5/deps/v8/src/objects/objects.cc#L4042
// static
Handle<WeakArrayList> PrototypeUsers::Add(Isolate* isolate,
                                          Handle<WeakArrayList> array,
                                          Handle<Map> value,
                                          int* assigned_index) {
  int length = array->length();
  if (length == 0) {
    // Uninitialized WeakArrayList; need to initialize empty_slot_index.
    array = WeakArrayList::EnsureSpace(isolate, array, kFirstIndex + 1);
    set_empty_slot_index(*array, kNoEmptySlotsMarker);
    array->Set(kFirstIndex, HeapObjectReference::Weak(*value));
    array->set_length(kFirstIndex + 1);
    if (assigned_index != nullptr) *assigned_index = kFirstIndex;
    return array;
  }

  // If the array has unfilled space at the end, use it.
  if (!array->IsFull()) {
    array->Set(length, HeapObjectReference::Weak(*value));
    array->set_length(length + 1);
     if (assigned_index != nullptr) *assigned_index = length;
    return array;
  }

  // If there are empty slots, use one of them.
  int empty_slot = Smi::ToInt(empty_slot_index(*array));

  if (empty_slot == kNoEmptySlotsMarker) {
    // GCs might have cleared some references, rescan the array for empty slots.
    PrototypeUsers::ScanForEmptySlots(*array);
    empty_slot = Smi::ToInt(empty_slot_index(*array));
  }
   if (empty_slot != kNoEmptySlotsMarker) {
    DCHECK_GE(empty_slot, kFirstIndex);
    CHECK_LT(empty_slot, array->length());
    int next_empty_slot = array->Get(empty_slot).ToSmi().value();

    array->Set(empty_slot, HeapObjectReference::Weak(*value));
    if (assigned_index != nullptr) *assigned_index = empty_slot;

    set_empty_slot_index(*array, next_empty_slot);
    return array;
  } else {
    DCHECK_EQ(empty_slot, kNoEmptySlotsMarker);
  }
  
   // Array full and no empty slots. Grow the array.
  array = WeakArrayList::EnsureSpace(isolate, array, length + 1);
  array->Set(length, HeapObjectReference::Weak(*value));
  array->set_length(length + 1);
  if (assigned_index != nullptr) *assigned_index = length;
  return array;
}

// static
void PrototypeUsers::ScanForEmptySlots(WeakArrayList array) {
  for (int i = kFirstIndex; i < array.length(); i++) {
    if (array.Get(i)->IsCleared()) {
      PrototypeUsers::MarkSlotEmpty(array, i);
    }
  }
}

不止內存泄漏

在我測試內存泄漏時，有一個發現，執行發生內存泄漏時的代碼（前文的 leak.js）和未發生內存泄漏時的代碼（前文的 no-leak.js）時，即便在已經修復該問題的 Node.js v12.16.2 版本下，generator 語法仍然有兩個問題：

內存回收效率低，致使執行完後，仍有至關大的內存佔用；
執行效率很是慢，async/await 版本僅須要 0.953 秒，而generator 卻須要 17.754 秒；

這說明，相比 generator 語法，async/await 語法不管從執行效率仍是內存佔用方面都有壓倒性優點。那麼執行效率對好比何呢？上 benchmark 工具比劃比劃：

// benchmark.js
const __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
  function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
  return new (P || (P = Promise))(function (resolve, reject) {
      function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
      function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
      function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
      step((generator = generator.apply(thisArg, _arguments || [])).next());
  });
};

const Benchmark = require('benchmark');
const suite = new Benchmark.Suite;

suite
  .add('generator', {
    defer: true,
    fn: function (deferred) {
      (function () {
        return __awaiter(this, void 0, void 0, function* () {
            yield Promise.resolve(1);
            yield Promise.resolve(2);
            // 測試完成
            deferred.resolve();
        });
      })();
    }
  })
   .add('async/await', {
    defer: true,
    fn: function(deferred) {
      (async function() {
        await Promise.resolve(1);
        await Promise.resolve(2);

        // 測試完成
        deferred.resolve();
      })()
    }
  })
  .on('cycle', function(event) {
    console.log(String(event.target));
  })
  .run({
    'async': false
  });

Node.js v12.16.2 的結果：

generator x 443,891 ops/sec ±4.12% (75 runs sampled)
async/await x 4,567,163 ops/sec ±1.96% (79 runs sampled)

generator 每秒執行了 516,178 次操做，而 async/await 每秒執行了 4,531,357 次操做，後者是前者的 10 倍多！咱們看看其它 Node.js 版本表現如何：

電腦配置：MacBook Pro (13-inch, 2017, Two Thunderbolt 3 ports)

兩者執行效率和 Node.js 版本成正比，而 Node.js v12 來了一次大躍進，直接高了一個數量級，這個得益於 v8 7.2 的一個新特性，官網用了整整一篇文章 https://v8.dev/blog/fast-async#await-under-the-hood 說明，有興趣的能夠看看。

Chrome 也中招了嗎？

目前最新版：版本 81.0.4044.113（正式版本）（64 位）已經修復這個問題

既然是 v8 的問題，那麼 chrome 瀏覽器也是有這個問題的，打開空白標籤頁，執行前文給出的 leak.js 代碼：

推廣時間

我叫林樂揚，點擊閱讀個人更多文章，以爲有收穫記得打開原文連接給我點個贊，或者關注我哦~

本文做者@林樂揚 | 原文@https://zhuanlan.zhihu.com/p/252689936

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。