排錯經歷：全局變量被屢次析構

時間 2019-12-07

標籤排錯經歷全局變量屢次欄目職業生涯简体版

原文原文鏈接

咱們team有一套C++寫的server程序，最近發現它在每次退出的時候會崩潰，core dump文件的棧以下：html

(gdb) btlinux

#0 0x0000003ea4e32925 in raise () from /lib64/libc.so.6c++

#1 0x0000003ea4e34105 in abort () from /lib64/libc.so.6shell

#2 0x0000003ea4e70837 in __libc_message () from /lib64/libc.so.6服務器

#3 0x0000003ea4e76166 in malloc_printerr () from /lib64/libc.so.6session

#4 0x0000003ea729d4c9 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() ()app

from /usr/lib64/libstdc++.so.6函數

#5 0x0000003ea4e35e22 in exit () from /lib64/libc.so.6post

#6 0x0000003ea4e1ed24 in __libc_start_main () from /lib64/libc.so.6優化

#7 0x0000000000400629 in _start ()

下面介紹一下我是如何找到出問題的代碼。

請注意，由於編譯器優化的緣故，這個棧是不完整的。安裝完調試符號後，棧應該是這樣：

(gdb) bt

#0 0x0000003ea4e32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64

#1 0x0000003ea4e34105 in abort () at abort.c:92

#2 0x0000003ea4e70837 in __libc_message (do_abort=2, fmt=0x3ea4f58aa0 「*** glibc detected *** %s: %s: 0x%s ***\n」)

at ../sysdeps/unix/sysv/linux/libc_fatal.c:198

#3 0x0000003ea4e76166 in malloc_printerr (action=3, str=0x3ea4f58d48 「double free or corruption (fasttop)」,

ptr=<value optimized out>) at malloc.c:6336

#4 0x0000003ea729d4c9 in _M_dispose (this=<value optimized out>, __in_chrg=<value optimized out>)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:236

#5 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string (this=<value optimized out>,

__in_chrg=<value optimized out>)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:503

#6 0x0000003ea4e35e22 in __run_exit_handlers (status=0) at exit.c:78

#7 exit (status=0) at exit.c:100

#8 0x0000003ea4e1ed24 in __libc_start_main (main=0x4006f0 <main(int, char**)>, argc=1, ubp_av=0x7fffffffe488,

init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe478)

at libc-start.c:258

#9 0x0000000000400629 in _start ()

惋惜咱們服務器上的glibc是自定製的，找不到調試符號。因此我就只好黑燈瞎火的搞。崩潰發生在main函數返回以後。exit()函數執行全局變量的析構，而後就崩潰了。看上去應該跟內存的釋放有關，可是我即使用valgrind也得不到任何有用的信息。只能知道是同一塊內存被釋放了兩次，並且這是一個std::string，並且是個全局變量。可是咱們的代碼有幾十萬行，從何查起啊…… 因而首要任務是，先找到這個std::string的指針以及它所存的字符串的內容。

因而我首先切換到和std::basic_string相關的一楨

(gdb) f 5

#5 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string (this=<value optimized out>,

__in_chrg=<value optimized out>)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:503

503 { _M_rep()->_M_dispose(this->get_allocator()); }

(gdb) p *this

No symbol 「this」 in current context.

惋惜this指針被優化掉了。若是我能找到this指針的值，就能直接用gdb的info sym命令獲得symbol name，從而得知是哪一個變量出錯了。無論怎樣，先在這斷下來再說。

可是每次我嘗試給這個函數加斷點的時候，gdb就崩潰了

../../gdb/breakpoint.c:5721: internal-error: set_raw_breakpoint: Assertion `sal.pspace != NULL’ failed.

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n) y

雖而後來我摸索出來了怎麼不讓gdb崩潰也能加斷點的辦法（直接b *addr)，但那是後話了。就算我在這裏加了斷點，也很難設置條件。這個函數在exit中被調用幾百次呢。

因而我就在它的下一層的函數加了斷點

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE

Breakpoint 1 at 0x3ea729c320

而後每次運行到這裏的時候，能夠看見this指針以及它的內容

Breakpoint 1, std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_destroy (this=0x601e10,

__a=…)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.tcc:450

450 _Raw_bytes_alloc(__a).deallocate(reinterpret_cast<char*>(this), __size);

(gdb) p *this

$1 = {<std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep_base> = {_M_length = 3,

_M_capacity = 3, _M_refcount = -1}, static _S_max_size = 4611686018427387897, static _S_terminal = 0 ‘\000’,

static _S_empty_rep_storage = {0, 0, 0, 0}}

_M_refcount是一個引用計數器。正常狀況下，走到這裏時_M_refcount應該等於-1。若是發生了重複析構，那麼_M_refcount就不等於-1。我想讓進程停在異常發生的時候，惋惜這個函數被調用的太頻繁了。因此我給它加個條件斷點來找異常的狀況。

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE if this->_M_refcount==-2

結果失敗了。gdb運行的時候說

Error in testing breakpoint condition:

value has been optimized out

因而我把這個對象的內存打出來，找出這個成員變量的偏移地址：

(gdb) x/16 this

0x601e10: 3 0 3 0

0x601e20: -1 0 7895160 0

0x601e30: 0 0 131537 0

0x601e40: 0 0 0 0

從上面的輸出能夠看出，它的地址和this指針相差4個int的大小。由於this指針存放在$edi中，因此this->_M_refcount的位置就是(int*)$edi+4。

因此上述的條件「if this->_M_refcount==-2」就能夠從新爲「if *((int*)$edi+4)!=-1」

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE if *((int*)$edi+4)!=-1

而後成功捕獲到了

Breakpoint 12, std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_destroy (this=0x4b4b510, __a=…) at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.tcc:450

450_Raw_bytes_alloc(__a).deallocate(reinterpret_cast<char*>(this), __size);

$4201 = {

<std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep_base> = {

_M_length = 78993104,

_M_capacity = 30,

_M_refcount = -2

members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep:

static _S_max_size = 4611686018427387897,

static _S_terminal = 0 ‘\000’,

static _S_empty_rep_storage = {0, 0, 0, 0}

}

可是字符串的內容在哪呢？查了下代碼，發現它就在Rep這個對象的後面。能夠直接用gdb的x命令打印，可是爲了打印的更漂亮點，把這段內存像金山遊俠那樣打印出來，我在網上找到了一個辦法

首先定義一個名爲xxd的函數

(gdb) define xxd

Redefine command 「xxd」? (y or n) y

Type commands for definition of 「xxd」.

End with a line saying just 「end」.

>dump binary memory dump.bin $arg0 ((char*)$arg0)+$arg1

>shell xxd dump.bin

>end

第一個參數是要打印的內存的起始地址，第二個參數是內存區的長度。

而後調用這個函數。

(gdb) xxd this 64

0000000: 0000 0000 0000 0000 1e00 0000 0000 0000 ................

0000010: feff ffff 0000 0000 6170 706c 6963 6174 ........applicat

0000020: 696f 6e2f 786d 6c3b 2063 6861 7273 6574 ion/xml; charset

0000030: 3d55 5446 2d38 0000 b101 0200 0000 0000 =UTF-8..........

一目瞭然，是」application/xml; charset=UTF-8」這個字符串被析構了兩次。而後去代碼中grep。惋惜，若是能直接找到string的地址的話，用info sym (addr) 就能夠直接知道符號名了，不用去代碼中grep。

下面我把出問題的代碼精簡出來，給你們看看：

main.cpp:它只管載入兩個so，其它什麼都不作

#include <stdio.h>
#include <dlfcn.h>

int main(int argc,char* argv[]){
  void* h1=dlopen("./libt1.so",RTLD_NOW|RTLD_GLOBAL);
  if (!h1) {
    fprintf(stderr, "%s:%d load t1 fail %s\n", __FILE__,(int)__LINE__,dlerror());
    return -1;
  }
  void* h2=dlopen("./libt2.so",RTLD_NOW|RTLD_GLOBAL);
  if (!h2) {
    fprintf(stderr, "%s:%d load t2 fail %s\n", __FILE__,(int)__LINE__,dlerror());
    dlclose(h1);
    return -1;
  }
  return 0;
}

libt1.cpp: libt1.so的源文件

#include <string>

std::string msg("application/xml; charset=UTF-8");

libt2.cpp: libt2.so的源文件，和libt1.cpp徹底相同

編譯命令

$ g++ -g3 -o t main.cpp -ldl

$ g++ -O2 -fPIC -shared -g3 -o libt1.so libt1.cpp

$ g++ -O2 -fPIC -shared -g3 -o libt2.so libt1.cpp

而後執行主程序，每次必崩潰。

$ ./t

Aborted (core dumped)

這是由於，msg這個變量是一個global的變量，當解析這個符號的時候，後加載的so會覆蓋前面。而且，每一個so在加載的時候，會向 exit函數註冊一個hook，由於這個全局變量須要在main函數退出後被析構。exit函數的hooks是一個鏈表。這兩個so各在其中註冊了一條。因此，當exit函數執行的時候，第二個so的msg這個變量所指向的內存會被析構兩次。我講的不是很清楚，你能夠按照上面的代碼實驗一下，很快的。

This article is from: https://www.sunchangming.com/blog/post/4648.html

本文來自：Linux教程網

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。