systemtap 探祕（二）- 由 probe 生成的 C 代碼

時間 2019-12-08

原文原文鏈接

上一篇文章，我簡單地介紹了 systemtap 的工做流程，以及第1、第二個階段的內容。從這篇文章開始，咱們將步入本系列的重頭戲 - 負責生成 C 代碼的第三階段。node

咱們能夠經過 stap -v test.stp -p3 > out.c 這樣的命令，讓 stap 把生成的 C 代碼重定向到 out.c 去。segmentfault

hello, world

按照慣例，先從一個」hello world「示例開始。api

probe begin {
    printf("hello")
}

probe oneshot {
    printf(" wor")
}

probe end {
    printf("ld\n")
}

出於本人的趣味，這裏把一個完整的 hello world 斷成三截。經過查找特定的字符串，咱們能夠很快地從生成的 C 代碼裏找到這三個 probe 對應生成的代碼。session

static void probe_3646 (struct context * __restrict__ c) {
  __label__ deref_fault;
  __label__ out;
  struct probe_3646_locals * __restrict__ l = & c->probe_locals.probe_3646;
  (void) l;
  if (c->actionremaining < 1) { c->last_error = "MAXACTION exceeded"; goto out; }
  (void)
  ({
    _stp_print ("hello");
  });
deref_fault: __attribute__((unused));
out:
  _stp_print_flush();
}

上面就是 probe begin 對應的代碼。函數

咱們能夠看到，每一個 probe 在執行時都會傳遞一個 context 參數。每一個 context 參數中有一個 struct probe_id_locals 變量。這個變量是用來存儲本地變量的，固然咱們的 hello world 示例中沒有用到本地變量，因此它們都是空的。atom

而後是檢查 MAXACTION exceeded 的部分，這部分參考 systemtap 的文檔，是限制一個 systemtap probe 的執行時間的，避免出現內核失去響應的情況。lua

接下來是debug

(void)
  ({
    _stp_print ("hello");
  });

咱們能夠看到，printf 這條語句被編譯成對應的內置函數的調用。並且爲了防止污染，每條語句的編譯結果還特地加了層花括號和大括號。rest

剩下兩個 probe 大同小異，只是 probe oneshot 會多一個 function___global_exit__overload_0 。function___global_exit__overload_0 調用了 _stp_exit 內置函數。code

每一個 probe 都會一個對應的 struct stap_be_probe 實例。從代碼裏能看到，enter_be_probe 函數會執行該 probe 的 handler，具體是在這麼一行：

(*stp->probe->ph) (c);

這一行以前的是一些準備代碼，以後的則是檢查執行過程當中是否有錯誤發生和統計執行時間等操做。注意傳遞給 probe 函數的 context 會被複用的。

而 enter_be_probe 會被 systemtap_module_init 和 systemtap_module_exit 調用。具體而言，probe begin 和 probe oneshot 會在 systemtap_module_init 這個函數裏調用（它們對應的 struct stap_be_probe 的 type 都是 0），而 probe end 會在 systemtap_module_exit 這個函數裏調用（type 是 1）。顧名思義，systemtap_module_init 和 systemtap_module_exit 分別在會話開始和結束時調用。你能夠在 systemtap 源碼的 runtime/transport/transport.txt 這個文件裏看到調用它們的具體流程。

能夠這麼認爲，systemtap 運行時有一個 begin 和 end 階段，probe begin 和 probe oneshot 都是運行在 begin 階段的。然後者會調用 _stp_exit 函數，標記要進入到 end 階段了。最後 probe end 會在 end 階段中運行。

那麼，begin 和 end 之間，是否存在一箇中間階段呢？答案固然是確定的。接下來，讓咱們看看一個包含 timer 的例子。

timer

把 probe oneshot 換成 probe timer.ms(149)：

probe timer.ms(149) {
    printf(" wor")
    exit()
}

比較生成出來的 probe 對應的 C 代碼，基本上跟原來是同樣的。可是 probe 部分以外有兩點不一樣。

一是沒有 probe timer.ms(149) 對應的 struct stap_be_probe 了。由於 probe timer.ms(149) 不是在 begin 或者 end 階段運行的。

二是多了個 struct stap_hrtimer_probe 類型。這個即是 probe timer.ms(149) 對應的 probe 類型了。從生成的代碼能夠看到，在 systemtap_module_init 裏面有一個 _stp_hrtimer_create。這個函數註冊了 _stp_hrtimer_notify_function。而 _stp_hrtimer_notify_function 幾乎是 enter_be_probe 的一個翻版。

值得注意的是，_stp_hrtimer_notify_function 在統計執行時間時多了一個檢查：

if (interval > STP_OVERLOAD_INTERVAL) {
          if (c->cycles_sum > STP_OVERLOAD_THRESHOLD) {
            _stp_error ("probe overhead exceeded threshold");
            atomic_set (session_state(), STAP_SESSION_ERROR);
            atomic_inc (error_count());
          }
          c->cycles_base = cycles_atend;
          c->cycles_sum = 0;
        }

這是爲了不一段時間內太多的時間用於執行 systemtap 而設置的，防止內核失去響應。

帶 timer 的 stp 腳本生成的 C 代碼中，並非在 begin 階段以後就經過 _stp_exit 切入到 end 階段，而是註冊了個 timer，並在 timer 裏執行 probe 的邏輯。在這以後，才由於 timer 中調用了 _stp_exit 而切入到 end 階段。

下面，讓咱們看看帶 uprobe 的例子。

uprobe

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    printf(" wor")
    exit()
}

上面的 stp 代碼掛載了 luajit 可執行文件的 lj_str_new 函數。注意要想把這個腳本運行起來，須要確保已經提供了 luajit 的 debuginfo。

生成的 C 代碼裏，該 probe 對應的類型是 stapiu_consumer。

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

奇怪的是這裏面的 0x6a55。代碼裏並無這個數，它是怎麼來的呢？

經過 readelf -s /usr/local/openresty/luajit/bin/luajit | grep lj_str_new 咱們能看到，這個函數的地址是 0x406a55。固然，實際的運行地址應該是 X + 0x406a55，而 X 是隨機的。因爲 0x400000 是在程序連接時固定的基址，咱們能夠認爲 lj_str_new 的地址是 X + 0x40000 + 0x6a55。換句話說，把 0x6a55 做爲 offset 就能肯定 lj_str_new 這個函數的位置。這也是爲何須要提供 luajit 的 debuginfo，由於沒有 debuginfo 的話，是沒法肯定 lj_str_new 的地址的。

stapiu_consumer 是在 stapiu_probe_handler 裏執行的，執行過程跟前兩種 probe 同樣。systemtap 會檢查當前已存在和新建立的全部進程，若是某些進程的可執行文件匹配某個 probe，會把對應的 probe 經過內核 API 註冊上去。內核觸發回調時就會執行該函數。

值得強調的是，每一個匹配的進程都會執行 probe。指定 -x PID 其實只會設置 target() 的值。若是不想被多個進程觸發，你還須要本身在 stp 代碼裏解決：

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    _target = target();
    if (pid() != _target) {
        next;
    }

    printf(" wor")
    exit()
}

-c CMD 也是一樣的，該選項其實就是建立一個子進程，並以該子進程的 PID 做爲 target() 的值。

uretprobe

最後，看下跟 uprobe 相對的，uretprobe 的狀況。

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new").return {
    printf(" wor")
    exit()
}

由上面的 stp 代碼生成的 C 代碼基本上相似於 uprobe。只是 stapiu_consumer 有點不一樣：

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .return_p=1, .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

多了個 return_p=1。