使用Valgrind和ThreadSanitizer檢測多線程錯誤

時間 2020-06-11

標籤使用 valgrind threadsanitizer 檢測多線程錯誤欄目 Java 简体版

原文原文鏈接

作畢設的時候，我曾經遇到一個多線程的BUG。這個BUG表現得較爲詭異，會致使數據隨機出錯。因爲找不出什麼規律，一開始我仍是挺頭疼的。查了半天后我發現，相關的日誌有多線程下共享數據訪問問題的跡象（即所謂的data race），因此很快確診是多線程部分代碼存在邏輯錯誤。這個問題的解決辦法很簡單，就是把相關的代碼review下，找出data race的部分並加以修正。雖然BUG是搞定了，不過我仍是想找到一個自動化工具，可以檢測出代碼中潛在的線程安全問題。這樣就能把BUG消滅在萌芽之中，而不是等到過後才睜大眼睛揪它出來。linux

搜索了下，發現了兩個適合作這個的工具，Valgrind和ThreadSanitizer。今天就來介紹下這兩個工具。安全

Valgrind

Valgrind通常用作內存泄露和訪存越界檢測，除此以外，其實它也支持對data race及一些簡單的多線程問題的檢查。Valgrind工具集裏面，helgrind和drd都能用來完成這種檢測。你能夠用valgrind --tool=helgrind或valgrind --tool=drd來啓用它。只要應用使用的線程模型是POSIX thread（pthread），這兩個工具就能進行檢測。這兩個工具間差異不大，下面我就基於helgrind來介紹下用法：多線程

先上一段有問題的示例代碼：app

// raceCondition.cpp
#include <pthread.h>

void *write_buffer(void *args)
{
    pthread_t *buffer = static_cast<pthread_t *>(args);
    *buffer = pthread_self();
    pthread_exit(0);
    return NULL;
}

int main()
{
    pthread_t *buffer = new pthread_t[2];
    pthread_t a, b;

    pthread_create(&a, NULL, write_buffer, buffer);
    pthread_create(&b, NULL, write_buffer, buffer);
    pthread_join(a, NULL);
    pthread_join(b, NULL);
    delete []buffer;

    return 0;
}

這段代碼有一個刻意爲之的問題，線程a和線程b寫入了同一個緩衝區。dom

用Valgrind能夠檢測出問題：ide

==5697== ---Thread-Announcement------------------------------------------
==5697==
==5697== Thread #3 was created
==5697==    at 0x545943E: clone (clone.S:74)
==5697==    by 0x5148199: do_clone.constprop.3 (createthread.c:75)
==5697==    by 0x51498BA: create_thread (createthread.c:245)
==5697==    by 0x51498BA: pthread_create@@GLIBC_2.2.5 (pthread_create.c:611)
==5697==    by 0x4C30E0D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5697==    by 0x400928: main (in /home/lzx/C/thread_error/a.out)
==5697==
==5697== ---Thread-Announcement------------------------------------------
==5697==
==5697== Thread #2 was created
==5697==    at 0x545943E: clone (clone.S:74)
==5697==    by 0x5148199: do_clone.constprop.3 (createthread.c:75)
==5697==    by 0x51498BA: create_thread (createthread.c:245)
==5697==    by 0x51498BA: pthread_create@@GLIBC_2.2.5 (pthread_create.c:611)
==5697==    by 0x4C30E0D: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5697==    by 0x40090B: main (in /home/lzx/C/thread_error/a.out)
==5697==
==5697== ---Thread-Announcement------------------------------------------
==5697==
==5697== Thread #1 is the program's root thread
==5697==
==5697== ----------------------------------------------------------------
==5697==
==5697== Possible data race during write of size 8 at 0x5C40040 by thread #3
==5697== Locks held: none
==5697==    at 0x4008BB: write_buffer(void*) (in /home/lzx/C/thread_error/a.out)
==5697==    by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5697==    by 0x5149181: start_thread (pthread_create.c:312)
==5697==    by 0x545947C: clone (clone.S:111)
==5697==
==5697== This conflicts with a previous write of size 8 by thread #2
==5697== Locks held: none
==5697==    at 0x4008BB: write_buffer(void*) (in /home/lzx/C/thread_error/a.out)
==5697==    by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5697==    by 0x5149181: start_thread (pthread_create.c:312)
==5697==    by 0x545947C: clone (clone.S:111)
==5697==  Address 0x5c40040 is 0 bytes inside a block of size 16 alloc'd
==5697==    at 0x4C2CC20: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==5697==    by 0x4008EA: main (in /home/lzx/C/thread_error/a.out)
==5697==  Block was alloc'd by thread #1

輸出結果包括data race的內存位置、內存區域大小和涉及的線程，以及調用棧。
若是編譯程序時加了-g選項，那麼輸出的調用棧中會有具體的位置：工具

==7993== This conflicts with a previous write of size 8 by thread #2
==7993== Locks held: none
==7993==    at 0x4008BB: write_buffer(void*) (raceCondition.cpp:8)
==7993==    by 0x4C30FA6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7993==    by 0x5149181: start_thread (pthread_create.c:312)
==7993==    by 0x545947C: clone (clone.S:111)
==7993==  Address 0x5c40040 is 0 bytes inside a block of size 16 alloc'd
==7993==    at 0x4C2CC20: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7993==    by 0x4008EA: main (raceCondition.cpp:17)
==7993==  Block was alloc'd by thread #1

Valgrind記錄了每一個線程的內存訪問狀況，若是多個線程對同一個內存地址的訪問沒有限定次序（諸如happen before這樣的memory model細則），就會被判爲「Possible data race」。性能

Valgrind的檢測一樣對C++11提供的thread庫生效（只要底層用的仍是pthread）：測試

#include <thread>

using namespace std;

void write_buffer(thread::id *buffer)
{
    *buffer = this_thread::get_id();
}

int main()
{
    thread::id *buffer = new thread::id[2];
    thread a(write_buffer, buffer);
    thread b(write_buffer, buffer);
    a.join();
    b.join();
    return 0;
}

輸出報告跟pthread版本的差很少。因爲輸出太長，這裏就不貼了。this

除了data race，Valgrind也能檢測出一些簡單的多線程問題，好比線程結束時沒有釋放鎖：

#include <pthread.h>

pthread_mutex_t mutex;

void *still_locked(void *args)
{
    (void)args;
    pthread_mutex_lock(&mutex);
    pthread_exit(0);
    return NULL;
}

int main()
{
    pthread_mutex_init(&mutex, NULL);
    pthread_t a;
    pthread_create(&a, NULL, still_locked, NULL);
    pthread_join(a, NULL);
    return 0;
}

==6316== Thread #2: Exiting thread still holds 1 lock
==6316==    at 0x4E4521F: start_thread (pthread_create.c:457)

即便線程detach了也能檢測出來。

#include <pthread.h>

pthread_mutex_t mutex;

void *still_locked(void *args)
{
    (void)args;
    pthread_detach(pthread_self());
    pthread_mutex_lock(&mutex);
    pthread_exit(0);
    return NULL;
}

int main()
{
    pthread_mutex_init(&mutex, NULL);
    pthread_t a;
    pthread_create(&a, NULL, still_locked, NULL);
    return 0;
}

==6574== Thread #2: Exiting thread still holds 1 lock
==6574==    at 0x4E4521F: start_thread (pthread_create.c:457)

ThreadSanitizer

ThreadSanitizer是另一個檢測多線程問題的工具，集成於gcc 4.8和clang 3.2以上的版本。
換句話說，只要你的編譯器版本不太舊，那麼你就能夠馬上啓用它。

對於clang，須要使用下列的編譯/連接選項：

clang -fsanitize=thread -fPIE -pie -g

對於gcc，可能還要加上-ltsan

gcc -fsanitize=thread -fPIE -pie -g -ltsan

若是出現了連接錯誤，檢查下是否有libtsan這個庫。

以上節展現的第一段代碼爲例：

$ g++ raceCondition.cpp -fsanitize=thread -fPIE -pie -g -ltsan
$ ./a.out
==================
WARNING: ThreadSanitizer: data race (pid=8425)
  Write of size 8 at 0x7d020000eff0 by thread T2:
    #0 write_buffer(void*) /home/lzx/C/thread_error/raceCondition.cpp:8 (exe+0x000000000c1b)
    #1 __tsan_write_range ??:0 (libtsan.so.0+0x00000001b1c9)

  Previous write of size 8 at 0x7d020000eff0 by thread T1:
    #0 write_buffer(void*) /home/lzx/C/thread_error/raceCondition.cpp:8 (exe+0x000000000c1b)
    #1 __tsan_write_range ??:0 (libtsan.so.0+0x00000001b1c9)

  Location is heap block of size 16 at 0x7d020000eff0 allocated by main thread:
    #0 operator new[](unsigned long) ??:0 (libtsan.so.0+0x00000001cfe2)
    #1 main /home/lzx/C/thread_error/raceCondition.cpp:17 (exe+0x000000000c5b)

  Thread T2 (tid=8427, running) created by main thread at:
    #0 pthread_create ??:0 (libtsan.so.0+0x00000001eccb)
    #1 main /home/lzx/C/thread_error/raceCondition.cpp:21 (exe+0x000000000c9d)

  Thread T1 (tid=8426, finished) created by main thread at:
    #0 pthread_create ??:0 (libtsan.so.0+0x00000001eccb)
    #1 main /home/lzx/C/thread_error/raceCondition.cpp:20 (exe+0x000000000c7e)

SUMMARY: ThreadSanitizer: data race /home/lzx/C/thread_error/raceCondition.cpp:8 write_buffer(void*)
==================
ThreadSanitizer: reported 1 warnings

輸出結果跟Valgrind的大同小異。ThreadSanitizer的檢測機制跟Valgrind類似，也是檢測各線程對內存的訪問是否有序。不一樣的是，ThreadSanitizer會在編譯時給特定的訪存操做注入監控指令，而不是在運行時監控所有的訪存操做。這麼一來，ThreadSanitizer的內存佔用和性能損耗會比Valgrind的少不少，這也是它的主打優勢。

ThreadSanitizer的另外一個主打優勢是，它支持的data race檢測要比Valgrind的更多。
不過，在某些方面（好比上文提到的線程結束時沒有釋放鎖）的檢測，ThreadSanitizer卻又不如Valgrind。

結論

這兩個工具之間，我偏好Valgrind。在資源佔用方面，除非你的項目已經達到Chrome級別，不然不用太在乎運行測試的用時；在功能方面，二者間差別不大；而ThreadSanitizer用起來相對麻煩一些。它須要特定的編譯指令，一旦跟現有的編譯方式衝突就很蛋疼了。

事實上，若是要我給這兩個工具打分，滿分100我只能給70.
這兩個工具的輸出都很含糊。Possible data race？Previous write by thread X？在現實應用中使用時，出問題之處要比上述的示例代碼難理解多了。並且，Valgrind的多線程問題檢測有必定可能出現誤報。（以前在畢設的應用中就遇到過）

另外，只有進行了內存訪問纔會觸發data race的檢測。對於一類小几率觸發的data race問題，這兩個工具不必定能檢測出來。
寫一個randomRaceCondition.cpp做爲例子：

#include <cstdlib>
#include <ctime>
#include <pthread.h>

void *write_buffer(void *args) {
    pthread_t *buffer = static_cast<pthread_t *>(args);
    if (rand() % 2 == 0) { // 如今線程a和b都進行訪存操做的機率爲1/4
        *buffer = pthread_self();
    }
    pthread_exit(0);
    return NULL;
}

int main()
{
    srand(time(0));
    pthread_t *buffer = new pthread_t[2];
    pthread_t a, b;

    pthread_create(&a, NULL, write_buffer, buffer);
    pthread_create(&b, NULL, write_buffer, buffer);
    pthread_join(a, NULL);
    pthread_join(b, NULL);
    delete []buffer;

    return 0;
}

不管是Valgrind仍是ThreadSanitizer，在單次運行內檢測出data race都是個隨機事件了。

最後，Valgrind/ThreadSanitizer所能檢測出的線程問題只佔了一小部分。對於許多棘手的多線程問題，它們也無能爲力。工具報告沒問題並不確保代碼沒問題，要寫出線程安全的代碼，仍是得多花點心思。