cudaErrorCudartUnloading問題排查及建議方案

時間 2019-11-11

標籤 cudaerrorcudartunloading 問題排查建議方案简体版

原文原文鏈接

原文請猛戳這裏html

敲黑板劃重點——順求異構計算/高性能計算/CUDA/ARM優化類開發職位linux

最近一段時間一直在負責作我廠神經網絡前向框架庫的優化，前幾天接了一個bug report，報錯信息大致是這樣的：git

Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFreeHost.

一樣的庫連接出來的可執行文件，有的會出現這種問題有的不會，一開始讓我很天然覺得是使用庫的應用程序出了bug。排除了這種可能以後，這句話最後的cudaFreeHost又讓我想固然地覺得是個內存相關的問題，折騰了一陣後才發現方向又雙叒叕錯了。並且我發現，不管我在報錯的那段代碼前使用任何CUDA runtime API，都會出現這個錯誤。
後來在網上查找相關信息，如下的bug report雖然沒有具體解決方案，但類似的call stack讓我懷疑這和我遇到的是同一個問題，並且也讓我把懷疑的目光聚焦在"driver shutting down"而非cudaFreeHost上。程序員

強制阻止"driver shutting down"？

首先一個看似理所固然的思路是：咱們可否在使用CUDA API時防止CUDA driver不被shutdown呢？問題在於"driver shutting down"究竟指的是什麼？若是從cudaErrorCudartUnloading的字面意思來說，極可能是指cuda_runtime的library被卸載了。
因爲咱們用的是動態連接庫，因而我嘗試在報錯的地方前加上dlopen強制加載libcuda_runtime.so。改完後立刻發現不對，若是是動態庫被卸載，理應是調用CUDA API時發現相關symbol都沒有定義纔對，而不該該是能夠正常調用動態庫的函數、而後返回error code這樣的runtime error現象。
此外，我經過strace發現，還有諸如libcuda.so、libnvidia-fatbinaryloader.so之類的動態庫會被加載，都要試一遍並不現實。況且和CUDA相關的動態庫並很多（可參考《NVIDIA Accelerated Linux Graphics Driver README and Installation Guide》中的「Chapter 5. Listing of Installed Components」），不一樣的程序依賴的動態庫也不盡相同，上述作法即便可行，也很難通用。github

無獨有偶，在nvidia開發者論壇上也有開發者有相似的想法，被官方人士否認了：apache

For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.
No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.api

如何使CUDA runtime API正常運做？

對於CUDA應用程序開發者而言，咱們一般是經過調用CUDA runtime API來向GPU設備下達咱們的指令。因此首先讓咱們來看，在程序中調用CUDA runtime API時，有什麼角色參與了進來。我從Nicholas Wilt的《The CUDA Handbook》中借了一張圖：緩存

{% img http://galoisplusplus.coding.... %}安全

咱們能夠看到，主要的角色有：運行在操做系統的User Mode下的CUDART(CUDA Runtime) library（對於動態庫來講就是上文提到的libcuda_runtime.so）和CUDA driver library（對於動態庫來講就是上文提到的libcuda.so），還有運行在Kernel Mode下的CUDA driver內核模塊。衆所周知，咱們的CUDA應用程序是運行在操做系統的User Mode下的，沒法直接操做GPU硬件，在操做系統中有權控制GPU硬件的是運行在Kernel Mode下的內核模塊（OT一下，做爲CUDA使用者，咱們不多能感受到這些內核模塊的存在，也它們許最有存在感的時候就是咱們趕上Driver/library version mismatch錯誤了XD）。在Linux下咱們能夠經過lsmod | grep nvidia來查看這些內核模塊，一般有管理Unified Memory的nvidia_uvm、Linux內核Direct Rendering Manager顯示驅動nvidia_drm、還有nvidia_modeset。與這些內核模塊溝通的是運行在User Mode下的CUDA driver library，咱們所調用的CUDA runtime API會被CUDART library轉換成一系列CUDA driver API，交由CUDA driver library這個鏈接CUDA內核模塊與其餘運行在User Mode下CUDA library的中介。網絡

那麼，要使CUDA runtime API所表示的指令能被正常傳達到GPU，就須要上述角色都能通力協做了。這就天然引起一個問題：在咱們的程序運行的時候，這些角色何時開始/結束工做？它們何時被初始化？咱們不妨strace看一下CUDA應用程序的系統調用：
首先，libcuda_runtime.so、libcuda.so、libnvidia-fatbinaryloader.so等動態庫被加載。當前被加載進內核的內核模塊列表文件/proc/modules被讀取，因爲nvidia_uvm、nvidia_drm等模塊以前已被加載，因此不須要額外insmod。接下來，設備參數文件/proc/driver/nvidia/params被讀取，相關的設備——如/dev/nvidia0（GPU卡0）、/dev/nvidia-uvm（看名字天然與Unified Memory有關，多是Pascal體系Nvidia GPU的Page Migration Engine）、/dev/nvidiactl等——被打開，並經過ioctl初始化設定。（此外，還有home目錄下~/.nv/ComputeCache的一些文件被使用，這個目錄是用來緩存PTX僞彙編JIT編譯後的二進制文件fat binaries，與咱們當前的問題無關，感興趣的朋友可參考Mark Harris的《CUDA Pro Tip: Understand Fat Binaries and JIT Caching》。）要使CUDA runtime API能被正常執行，須要完成上述動態庫的加載、內核模塊的加載和GPU設備設置。

但以上還只是從系統調用角度來探究的一個必要條件，還有一個條件寫過CUDA的朋友應該不陌生，那就是CUDA context（若是你沒印象了，能夠回顧一下CUDA官方指南中講初始化和context的部分）。咱們都知道：全部CUDA的資源（包括分配的內存、CUDA event等等）和操做都只在CUDA context內有效；在第一次調用CUDA runtime API時，若是當前設備沒有建立CUDA context，新的context會被建立出來做爲當前設備的primary context。這些操做對於CUDA runtime API使用者來講是不透明的，那麼又是誰作的呢？讓我來引用一下SOF上某個問題下community wiki的標準答案：

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error.

這段話提供了幾個信息：一是nvcc插入了一些代碼來完成的CUDA context的建立和銷燬所須要作的準備工做，二是CUDA context銷燬以後再調用CUDA runtime API就可能會出現runtime error這樣的未定義行爲（Undefined Behaviour，簡稱UB）。

接下來讓咱們來稍微深刻地探究一下。咱們有若干.cu文件經過nvcc編譯後產生的.o文件，還有這些.o文件連接後生成的可執行文件exe。咱們經過nm等工具去查看這些.o文件，不難發現這些文件的代碼段中都被插入了一個以__sti____cudaRegisterAll_爲名字前綴的函數。咱們在gdb <exe>中對其中函數設置斷點再單步調試，能夠看到相似這樣的call stack：

(gdb) bt
#0  0x00002aaab16695c0 in __cudaRegisterFatBinary () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3eee1 in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#2  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#3  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#4  0x0000000000000001 in  ()
#5  0x00007fffffffe2a8 in  ()
#6  0x0000000000000000 in  ()

再執行若干步，call stack就變成：

(gdb) bt
#0  0x00002aaab16692b0 in __cudaRegisterFunction () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() (__T263=0x7c4b30)
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:97
#2  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#3  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in  ()
#6  0x00007fffffffe2a8 in  ()
#7  0x0000000000000000 in  ()

(gdb) bt
#0  0x00002aaaaae8ea20 in atexit () at XXX.so
#1  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#2  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#3  0x0000000000000001 in  ()
#4  0x00007fffffffe2a8 in  ()
#5  0x0000000000000000 in  ()

那麼CUDA context什麼時候被建立完成呢？經過對cuInit設置斷點能夠發現，與官方指南的描述一致，也就是在進入main函數以後調用第一個CUDA runtime API的時候：

(gdb) bt
#0  0x00002aaab1ab7440 in cuInit () at /lib64/libcuda.so.1
#1  0x00002aaab167add5 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#2  0x00002aaab167ae31 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#3  0x00002aaabe416bb0 in pthread_once () at /lib64/libpthread.so.0
#4  0x00002aaab16ad919 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#5  0x00002aaab167700a in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#6  0x00002aaab167aceb in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#7  0x00002aaab16a000a in cudaGetDevice () at /usr/local/cuda/lib64/libcudart.so.8.0
...
#10 0x0000000000405d77 in main(int, char**) (argc=<optimized out>, argv=<optimized out>)

其中，和context建立相關的若干函數就在${CUDA_PATH}/include/crt/host_runtime.h中聲明過：

#define __cudaRegisterBinary(X)                                                   \
        __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); \
        { void (*callback_fp)(void **) =  (void (*)(void **))(X); (*callback_fp)(__cudaFatCubinHandle); }\
        atexit(__cudaUnregisterBinaryUtil)
       

extern "C" {
extern void** CUDARTAPI __cudaRegisterFatBinary(
  void *fatCubin
);

extern void CUDARTAPI __cudaUnregisterFatBinary(
  void **fatCubinHandle
);

extern void CUDARTAPI __cudaRegisterFunction(
        void   **fatCubinHandle,
  const char    *hostFun,
        char    *deviceFun,
  const char    *deviceName,
        int      thread_limit,
        uint3   *tid,
        uint3   *bid,
        dim3    *bDim,
        dim3    *gDim,
        int     *wSize
);
}

static void **__cudaFatCubinHandle;

static void __cdecl __cudaUnregisterBinaryUtil(void)
{
  ____nv_dummy_param_ref((void *)&__cudaFatCubinHandle);
  __cudaUnregisterFatBinary(__cudaFatCubinHandle);
}

但這些函數都沒有文檔，Yong Li博士寫的《GPGPU-SIM Code Study》稍微詳細一些，我就直接貼過來了：

The simplest way to look at how nvcc compiles the ECS (Execution Configuration Syntax) and manages kernel code is to use nvcc’s --cuda switch. This generates a .cu.c file that can be compiled and linked without any support from NVIDIA proprietary tools. It can be thought of as CUDA source files in open source C. Inspection of this file verified how the ECS is managed, and showed how kernel code was managed.

Device code is embedded as a fat binary object in the executable’s .rodata section. It has variable length depending on the kernel code.

For each kernel, a host function with the same name as the kernel is added to the source code.

Before main(..) is called, a function called cudaRegisterAll(..) performs the following work:

• Calls a registration function, cudaRegisterFatBinary(..), with a void pointer to the fat binary data. This is where we can access the kernel code directly.

• For each kernel in the source file, a device function registration function, cudaRegisterFunction(..), is called. With the list of parameters is a pointer to the function mentioned in step 2.

As aforementioned, each ECS is replaced with the following function calls from the execution management category of the CUDA runtime API.

• cudaConfigureCall(..) is called once to set up the launch configuration.

• The function from the second step is called. This calls another function, in which, cudaSetupArgument(..) is called once for each kernel parameter. Then, cudaLaunch(..) launches the kernel with a pointer to the function from the second step.

An unregister function, cudaUnregisterBinaryUtil(..), is called with a handle to the fatbin data on program exit.

其中，cudaConfigureCall、cudaSetupArgument、cudaLaunch在CUDA7.5之後已經「過氣」（deprecated）了，因爲這些並非在進入main函數以前會被調用的API，咱們能夠不用管。咱們須要關注的是，在main函數被調用以前，nvcc加入的內部初始化代碼作了如下幾件事情（咱們能夠結合上面host_runtime.h頭文件暴露出的接口和相關call stack來確認）：

經過__cudaRegisterFatBinary註冊fat binary入口函數。這是CUDA context建立的準備工做之一，若是在__cudaRegisterFatBinary執行以前調用CUDA runtime API極可能也會出現UB。SOF上就有這樣一個問題，題主在static對象構造函數中調用了kernel函數，結果就出現了"invalid device function"錯誤，SOF上的CUDA大神talonmies的答案就探究了static對象構造函數和__cudaRegisterFatBinary的調用順序及其產生的問題，很是推薦一讀。
經過__cudaRegisterFunction註冊每一個device的kernel函數
經過atexit註冊__cudaUnregisterBinaryUtil的註銷函數。這個函數是CUDA context銷燬的清理工做之一，前面提到，CUDA context銷燬以後CUDA runtime API就極可能沒法再被正常使用了，換言之，若是CUDA runtime API在__cudaUnregisterBinaryUtil執行完後被調用就有多是UB。而__cudaUnregisterBinaryUtil在何時被調用又是符合atexit規則的——在main函數執行完後程序exit的某階段被調用（main函數的執行過程能夠參考這篇文章）——這也是咱們理解和解決cudaErrorCudartUnloading問題的關鍵之處。

{% img http://galoisplusplus.coding.... %}

一切皆全局對象之過

吃透本碼渣上述囉裏囉唆的理論後，再經過代碼來排查cudaErrorCudartUnloading問題就簡單了。原來，竟和以前提過的SOF上的問題類似，咱們代碼中也使用了一個全局static singleton對象，在singleton對象的析構函數中調用CUDA runtime API來執行釋放內存等操做。而咱們知道，static對象是在main函數執行完後exit進行析構的，而以前提到__cudaUnregisterBinaryUtil也是在這個階段被調用，這二者的順序是未定義的。若是__cudaUnregisterBinaryUtil等清理context的操做在static對象析構以前就調用了，就會產生cudaErrorCudartUnloading報錯。這種UB也解釋了，爲什麼以前咱們的庫連接出來的不一樣可執行文件，有的會出現這個問題而有的不會。

解決方案

在github上搜cudaErrorCudartUnloading相關的patch，處理方式也是五花八門，這裏姑且列舉幾種。

跳過`cudaErrorCudartUnloading`檢查

好比arrayfire項目的這個patch。能夠，這很佛系（滑稽）

-    CUDA_CHECK(cudaFree(ptr));
+    cudaError_t err = cudaFree(ptr);
+    if (err != cudaErrorCudartUnloading) // see issue #167
+        CUDA_CHECK(err);

乾脆把可能會有`cudaErrorCudartUnloading`的CUDA runtime API去掉

好比kaldi項目的這個issue和PR。論佛系，誰都不服就服你（滑稽）

把CUDA runtime API放到一個獨立的de-initialisation、finalize之類的接口，讓用戶在`main`函數`return`前調用

好比MXNet項目的MXNotifyShutdown（參見：c_api.cc）。佛繫了辣麼久總算看到了一種符合本程序員審美的「優雅」方案（滑稽）

剛好在SOF另外一個問題中，talonmies大神（啊哈，又是talonmies大神！）在留言裏也表達了同樣的意思，不能贊同更多啊：

The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue

上面的方案雖然「優雅」，但對於庫維護者卻有多了一層隱憂：萬一加了個接口，使用者要撕逼呢？（滑稽）萬一使用者根本就不鳥你，沒在main函數return前調用呢？要說別人打開方式不對，人家還能夠說是庫的實現不夠穩健把你批判一通呢。若是你也有這種隱憂，請接着看接下來的「黑科技」。

土法黑科技（滑稽）

首先，CUDA runtime API仍是不能放在全局對象析構函數中，那麼應該放在什麼地方纔合適呢？畢竟咱們不知道庫使用者最後用的是哪一個API啊？不過，咱們卻能夠知道庫使用者使用什麼API時是在main函數的做用域，那個時候是能夠建立有效的CUDA context、正常使用CUDA runtime API的。這又和咱們析構函數中調用的CUDA runtime API有什麼關係呢？你可能還記得吧，前邊提到nvcc加入的內部初始化代碼經過atexit註冊__cudaUnregisterBinaryUtil的註銷函數，咱們天然也能夠如法炮製：

// 首先調用一個「無害」的CUDA runtime API，確保在調用`atexit`以前CUDA context已被建立
// 這樣就確保咱們經過`atexit`註冊的函數在CUDA context相關的銷燬函數（例如`__cudaUnregisterBinaryUtil`）以前就被執行
// 「無害」的CUDA runtime API？這裏指不會形成影響內存佔用等反作用的函數，我採用了`cudaGetDeviceCount`
// 《The CUDA Handbook》中推薦使用`cudaFree(0);`來完成CUDART初始化CUDA context的過程，這也是能夠的
int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);

std::atexit([](){
    // 調用原來在全局對象析構函數中的CUDA runtime API
});

那麼，應該在哪一個地方插入上面的代碼呢？解鈴還須繫鈴人，咱們的cudaErrorCudartUnloading問題出在static singleton對象身上，但如下singleton的惰性初始化卻也給了咱們提供了一個絕佳的入口：

// OT一下，和本中老年人同樣上了年紀的朋友可能知道
// 之前在C++中要實現線程安全的singleton有多蛋疼
// 有諸如Double-Checked Locking之類略噁心的寫法
// 但自打用了C++11以後啊，腰不酸了,背不疼了,腿啊也不抽筋了,碼代碼也有勁兒了（滑稽）
// 如下實如今C++11標準中是保證線程安全的
static Singleton& instance()
{
     static Singleton s;
     return s;
}

由於庫使用者只會在main函數中經過這個接口使用singleton對象，因此只要在這個接口初始化CUDA context並用atexit註冊清理函數就能夠辣！固然，做爲一位嚴謹的庫做者，你也許會問：不能對庫使用者抱任何幻想，萬一別人在某個全局變量初始化時調用了呢？Bingo！我只能說目前咱們的業務流程可讓庫使用者不會想這麼寫來噁心本身而已...（捂臉）萬一真的有這麼做的使用者，這種方法就失效了，使用者會遇到和前面提到的SOF某問題類似的報錯。畢竟，黑科技也不是萬能的啊！

後記

解決完cudaErrorCudartUnloading這個問題以後，又接到新的救火任務，排查一個使用加密狗API致使的程序閃退問題。加密狗和cudaErrorCudartUnloading兩個問題看似風馬牛不相及，本質居然也是類似的：又是同樣的UB現象；又是全局對象；又是在全局對象構造和析構時調用了加密狗API，和加密狗內部的初始化和銷燬函數的執行順序未定義。看來，不亂挖坑仍是要有基本的常識——在使用外設設備相關的接口時，要保證在main函數的做用域裏啊！