最近在作服務器的穩定性的相關測試,服務器的網絡底層使用的是boost asio,而後本身作的二次封裝以更好的知足需求。api
服務器昨天晚上發現crash了一次,以前測試了將近半個多月,有一次是莫名的退出了,不過因爲是新的測試服,忘記將ulimit -c進行修改了,因此沒有coredump,此次又發生了。服務器
coredump以下:網絡
#0 0x0000000000000091 in ?? () #1 0x0000000000459729 in ClientHandler::HandleConnect(cpnet::IConnection*) () #2 0x00000000004a0bbc in boost::asio::detail::completion_handler<boost::_bi::bind_t<void, boost::_mfi::mf1<void, cpnet::IMsgHandler, cpnet::IConnection*>, boost::_bi::list2<boost::_bi::value<cpnet::IMsgHandler*>, boost::_bi::value<cpnet::Connection*> > > >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) () #3 0x0000000000492e25 in boost::asio::detail::strand_service::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) () #4 0x0000000000493f20 in boost::asio::detail::task_io_service::run(boost::system::error_code&) () #5 0x0000000000495bb5 in boost::asio::io_service::run() () #6 0x00007ff3798153cf in thread_proxy () from /home/slither/slither/depends/libboost_thread.so.1.58.0 #7 0x00007ff3788d7df5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007ff3786051ad in clone () from /lib64/libc.so.6
整個網絡庫是針對boost asio作的二次封裝,程序因爲是release版本的,以前也沒有特別生成這個版本對應的symbols,我只能看出異常時候的堆棧信息和線程信息,其餘的東西我也沒有什麼好的辦法去查看。多線程
從堆棧異常中能夠看出來,最後出錯的地方是非法的地址方法,不過調用它的frame 1是咱們本身的函數,一個當有鏈接成功時候的回調函數。不過一開始看這個回調函數的堆棧感受很奇怪,由於是咱們本身內部經過strand dispatch的,按理說調用這個函數的上層函數也應該是咱們本身寫的代碼,然而狀況不是這樣,那麼不多是堆棧顯示錯了,而應該是本身以前對於dispatch的當即錯了。ide
這個HandleConnect我是使用boost strand作dispatch分發的,這裏就涉及到我以前對strand dispatch接口的一個誤讀了,以前看文檔不夠細緻,覺得strand dispath是當即執行的,其實這是不對的,dispath的接口說明文檔是這樣的:函數
/** * This function is used to ask the strand to execute the given handler. * * The strand object guarantees that handlers posted or dispatched through * the strand will not be executed concurrently. The handler may be executed * inside this function if the guarantee can be met. If this function is * called from within a handler that was posted or dispatched through the same * strand, then the new handler will be executed immediately. * * The strand's guarantee is in addition to the guarantee provided by the * underlying io_service. The io_service guarantees that the handler will only * be called in a thread in which the io_service's run member function is * currently being invoked. * * @param handler The handler to be called. The strand will make a copy of the * handler object as required. The function signature of the handler must be: * @code void handler(); @endcode */
注意紅色標記的那段,若是調用strand dispatch的時候,是持有相同strand調用的,那麼當前dispatch的handler會當即執行。也就是說在多線程的時候,若是咱們的線程調用strand dispatch的時候,其餘線程已經在調用了,那麼其實它是不會當即執行的,會放到等待隊列裏面去的。post
asio中的dispatch代碼是這樣的:測試
template <typename Handler> void strand_service::dispatch(strand_service::implementation_type& impl, Handler& handler) { // If we are already in the strand then the handler can run immediately.
// 若是咱們已經在這個strand中了,那麼這個handler當即執行
if (call_stack<strand_impl>::contains(impl)) { fenced_block b(fenced_block::full); boost_asio_handler_invoke_helpers::invoke(handler, handler); return; } // Allocate and construct an operation to wrap the handler. typedef completion_handler<Handler> op; typename op::ptr p = { boost::asio::detail::addressof(handler), boost_asio_handler_alloc_helpers::allocate( sizeof(op), handler), 0 }; p.p = new (p.v) op(handler); BOOST_ASIO_HANDLER_CREATION((p.p, "strand", impl, "dispatch"));
// do_dispatch判斷是否可以當即執行 bool dispatch_immediately = do_dispatch(impl, p.p); operation* o = p.p; p.v = p.p = 0; if (dispatch_immediately) { // Indicate that this strand is executing on the current thread. call_stack<strand_impl>::context ctx(impl); // Ensure the next handler, if any, is scheduled on block exit. on_dispatch_exit on_exit = { &io_service_, impl }; (void)on_exit; completion_handler<Handler>::do_complete( &io_service_, o, boost::system::error_code(), 0); } }
再來看下do_dispatch的代碼:ui
bool strand_service::do_dispatch(implementation_type& impl, operation* op) { // If we are running inside the io_service, and no other handler already // holds the strand lock, then the handler can run immediately.
// 若是沒有其餘handler已經持有strand lock鎖,那麼這個handler就能夠當即執行
bool can_dispatch = io_service_.can_dispatch(); impl->mutex_.lock(); if (can_dispatch && !impl->locked_) { // Immediate invocation is allowed. impl->locked_ = true; impl->mutex_.unlock(); return true; } if (impl->locked_) { // Some other handler already holds the strand lock. Enqueue for later.
// 若是其餘handler已經持有strand鎖了,那麼放到隊列中
impl->waiting_queue_.push(op); impl->mutex_.unlock(); } else { // The handler is acquiring the strand lock and so is responsible for // scheduling the strand. impl->locked_ = true; impl->mutex_.unlock(); impl->ready_queue_.push(op); io_service_.post_immediate_completion(impl, false); } return false; }
經過asio strand的dispatch源代碼,咱們能夠看出來,咱們dispatch的handler是有可能不會被當即執行的。因爲咱們本身以前對於dispatch邏輯的認知錯誤,在dispatch handler以前,咱們就開始準備讀網絡數據,在比較特殊的狀況下,也就是客戶端剛連上,當即端口,那麼咱們讀網絡數據的函數就當即返回錯誤,因爲我本身封裝的Connection是使用shared_ptr作的封裝,若是沒有任何引用,就會析構掉,那麼等咱們以前dispatch的handler從隊列中被執行的時候,以前傳遞的Connection指針已是野指針了,就致使程序crash掉了。this
這種偶現的bug,是比較難被測試出來的,一般只有咱們本身進行多樣的壓力測試的時候,才比較容易發現。同時也是告誡本身在使用其餘第三方庫的時候,仍是要更加仔細的弄懂api。