服務器上線以後,發生了3次crash,感受是一次比較典型的內存bug的排錯經歷,因此特意記錄下來供之後借鑑。下面描述一下3次crash時候的coredump的當前堆棧信息。mysql
第一次crash的coredump文件:react
#0 0x00007f6f02d845f7 in raise () from /lib64/libc.so.6 #1 0x00007f6f02d85ce8 in abort () from /lib64/libc.so.6 #2 0x00007f6f02dc4317 in __libc_message () from /lib64/libc.so.6 #3 0x00007f6f02dcd86d in _int_malloc () from /lib64/libc.so.6 #4 0x00007f6f02dce8dc in malloc () from /lib64/libc.so.6 #5 0x00007f6f038a30cd in operator new(unsigned long) () from /lib64/libstdc++.so.6 #6 0x00007f6f03901c69 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /lib64/libstdc++.so.6 #7 0x00007f6f03903521 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) () from /lib64/libstdc++.so.6 #8 0x00007f6f03903958 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6 #9 0x00000000004b8880 in DataMng::WritePlayerBaseData(slither::PBPlayerBase const&) ()
從堆棧信息能夠看出來,是邏輯一個函數在構造string對象的時候,最後在malloc的時候crash了,也就是_int_malloc這裏開始。c++
第二次的coredump文件:sql
#0 0x00007f381f9ffa1d in malloc_consolidate () from /lib64/libc.so.6
#1 0x00007f381fa01ea5 in _int_malloc () from /lib64/libc.so.6
#2 0x00007f381fa038dc in malloc () from /lib64/libc.so.6
#3 0x00007f38207f2845 in my_malloc (size=size@entry=8160, my_flags=my_flags@entry=1040) at /export/home3/pb2/build/sb_0-19016729-1464156289.52/rpm/BUILD/mysql-5.7.13/mysql-5.6.25/mysys/my_malloc.c:38
#4 0x00007f38207f0cfc in alloc_root (mem_root=mem_root@entry=0x1a40c30, length=length@entry=24) at /export/home3/pb2/build/sb_0-19016729-1464156289.52/rpm/BUILD/mysql-5.7.13/mysql-5.6.25/mysys/my_alloc.c:224
#5 0x00007f38207c1e78 in cli_read_rows (mysql=mysql@entry=0x1a0d960, mysql_fields=mysql_fields@entry=0x0, fields=7) at /export/home3/pb2/build/sb_0-19016729-1464156289.52/rpm/BUILD/mysql-5.7.13/mysql-5.6.25/sql-common/client.c:1544
#6 0x00007f38207c306d in cli_read_query_result (mysql=0x1a0d960) at /export/home3/pb2/build/sb_0-19016729-1464156289.52/rpm/BUILD/mysql-5.7.13/mysql-5.6.25/sql-common/client.c:4144
#7 0x00007f38207c48f6 in mysql_real_query (mysql=0x1a0d960, query=<optimized out>, length=<optimized out>) at /export/home3/pb2/build/sb_0-19016729-1464156289.52/rpm/BUILD/mysql-5.7.13/mysql-5.6.25/sql-common/client.c:4181
#8 0x00000000004f90b3 in cputil::CMysqlConnection::Execute(char const*, cputil::CRecordSet&) ()
#9 0x00000000004b731b in DataMng::FindPlayerBaseInfoByPlayerId(unsigned long, slither::PBPlayerBase*) ()
第二次core文件,其實仍是在malloc的時候crash了。 api
第三次的coredump文件:服務器
#0 0x00007f48bc609d28 in _int_free () from /lib64/libc.so.6
#1 0x00000000004f49a1 in cpnet::Connection::HandleSend(boost::system::error_code const&, unsigned long) ()
#2 0x00000000004f7197 in boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >, boost::asio::const_buffers_1, boost::asio::detail::transfer_all_t, boost::_bi::bind_t<void, boost::_mfi::mf2<void, cpnet::Connection, boost::system::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<cpnet::Connection> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > >::operator()(boost::system::error_code const&, unsigned long, int) ()
#3 0x00000000004f74ff in boost::asio::detail::reactive_socket_send_op<boost::asio::const_buffers_1, boost::asio::detail::write_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >, boost::asio::const_buffers_1, boost::asio::detail::transfer_all_t, boost::_bi::bind_t<void, boost::_mfi::mf2<void, cpnet::Connection, boost::system::error_code const&, unsigned long>, boost::_bi::list3<boost::_bi::value<boost::shared_ptr<cpnet::Connection> >, boost::arg<1> (*)(), boost::arg<2> (*)()> > > >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) ()
#4 0x00000000004e8f90 in boost::asio::detail::task_io_service::run(boost::system::error_code&) ()
#5 0x00000000004eab85 in boost::asio::io_service::run() ()
第三次的core,是在free的時候,也就是釋放內存的時候crash了,從三次的堆棧上看,都是跟內存有關的bug。socket
第一次的異常很快就解決了,由於在作mysql_escape_string的時候沒有爲escape以後的buf申請更多空間,而是使用了跟原來同樣的空間大小,致使內存衝突了。tcp
剩下的另外2次coredump就比較有意思了,從邏輯角度是沒有任何問題的,並且也能夠保證沒有線程問題,一次是由於_int_malloc也就是分配內存的時候crash了,並且是調用mysql api的時候,mysql的對象狀態也是正常的。另一條是網路層處理消息隊列進行發送的時候,而後在_int_free_的時候報錯的,_int_free_報錯通常是因爲重複釋放內存,能夠確定函數裏面沒有手動釋放內存的地方,惟一有一個隊列pop了一下string對象。函數
聯想到第一次的coredump,我懷疑是因爲某種狀況下內存溢出,或者錯誤,影響到了後面操做使用到的內存地址,而後後續操做在碰巧遇到這個地址相關操做的時候,就會crash了。具體是不是這樣的,如今尚未辦法驗證,須要版本更新以後運行一段時間來看看修改後的穩定性。ui
通過修改後,一段時候以後再也沒有出現過crash,能夠基本肯定,全部的問題都是同一個問題,只是由於內存一旦錯亂以後,在crash的時候表現出各類奇怪的樣式。