網絡庫crash以及boost asio strand dispath分析

時間 2019-12-06

標籤網絡 crash 以及 boost asio strand dispath 分析欄目系統網絡简体版

原文原文鏈接

　　最近在作服務器的穩定性的相關測試，服務器的網絡底層使用的是boost asio，而後本身作的二次封裝以更好的知足需求。api

　　服務器昨天晚上發現crash了一次，以前測試了將近半個多月，有一次是莫名的退出了，不過因爲是新的測試服，忘記將ulimit -c進行修改了，因此沒有coredump，此次又發生了。服務器

coredump以下：網絡

#0  0x0000000000000091 in ?? ()
#1  0x0000000000459729 in ClientHandler::HandleConnect(cpnet::IConnection*) ()
#2  0x00000000004a0bbc in boost::asio::detail::completion_handler<boost::_bi::bind_t<void, boost::_mfi::mf1<void, cpnet::IMsgHandler, cpnet::IConnection*>, boost::_bi::list2<boost::_bi::value<cpnet::IMsgHandler*>, boost::_bi::value<cpnet::Connection*> > > >::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) ()
#3  0x0000000000492e25 in boost::asio::detail::strand_service::do_complete(boost::asio::detail::task_io_service*, boost::asio::detail::task_io_service_operation*, boost::system::error_code const&, unsigned long) ()
#4  0x0000000000493f20 in boost::asio::detail::task_io_service::run(boost::system::error_code&) ()
#5  0x0000000000495bb5 in boost::asio::io_service::run() ()
#6  0x00007ff3798153cf in thread_proxy () from /home/slither/slither/depends/libboost_thread.so.1.58.0
#7  0x00007ff3788d7df5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007ff3786051ad in clone () from /lib64/libc.so.6

　　整個網絡庫是針對boost asio作的二次封裝，程序因爲是release版本的，以前也沒有特別生成這個版本對應的symbols，我只能看出異常時候的堆棧信息和線程信息，其餘的東西我也沒有什麼好的辦法去查看。多線程

　　從堆棧異常中能夠看出來，最後出錯的地方是非法的地址方法，不過調用它的frame 1是咱們本身的函數，一個當有鏈接成功時候的回調函數。不過一開始看這個回調函數的堆棧感受很奇怪，由於是咱們本身內部經過strand dispatch的，按理說調用這個函數的上層函數也應該是咱們本身寫的代碼，然而狀況不是這樣，那麼不多是堆棧顯示錯了，而應該是本身以前對於dispatch的當即錯了。ide

　　這個HandleConnect我是使用boost strand作dispatch分發的，這裏就涉及到我以前對strand dispatch接口的一個誤讀了，以前看文檔不夠細緻，覺得strand dispath是當即執行的，其實這是不對的，dispath的接口說明文檔是這樣的：函數

/**
   * This function is used to ask the strand to execute the given handler.
   *
   * The strand object guarantees that handlers posted or dispatched through
   * the strand will not be executed concurrently. The handler may be executed
   * inside this function if the guarantee can be met. If this function is * called from within a handler that was posted or dispatched through the same * strand, then the new handler will be executed immediately.
   *
   * The strand's guarantee is in addition to the guarantee provided by the
   * underlying io_service. The io_service guarantees that the handler will only
   * be called in a thread in which the io_service's run member function is
   * currently being invoked.
   *
   * @param handler The handler to be called. The strand will make a copy of the
   * handler object as required. The function signature of the handler must be:
   * @code void handler(); @endcode
   */

　　注意紅色標記的那段，若是調用strand dispatch的時候，是持有相同strand調用的，那麼當前dispatch的handler會當即執行。也就是說在多線程的時候，若是咱們的線程調用strand dispatch的時候，其餘線程已經在調用了，那麼其實它是不會當即執行的，會放到等待隊列裏面去的。post

　　asio中的dispatch代碼是這樣的：測試

template <typename Handler>
void strand_service::dispatch(strand_service::implementation_type& impl,
    Handler& handler)
{
  // If we are already in the strand then the handler can run immediately.
　// 若是咱們已經在這個strand中了，那麼這個handler當即執行

  if (call_stack<strand_impl>::contains(impl))
  {
    fenced_block b(fenced_block::full);
    boost_asio_handler_invoke_helpers::invoke(handler, handler);
    return;
  }

  // Allocate and construct an operation to wrap the handler.
  typedef completion_handler<Handler> op;
  typename op::ptr p = { boost::asio::detail::addressof(handler),
    boost_asio_handler_alloc_helpers::allocate(
      sizeof(op), handler), 0 };
  p.p = new (p.v) op(handler);

  BOOST_ASIO_HANDLER_CREATION((p.p, "strand", impl, "dispatch"));
　
 
  // do_dispatch判斷是否可以當即執行 bool dispatch_immediately = do_dispatch(impl, p.p);
  operation* o = p.p;
  p.v = p.p = 0;

  if (dispatch_immediately)
  {
    // Indicate that this strand is executing on the current thread.
    call_stack<strand_impl>::context ctx(impl);

    // Ensure the next handler, if any, is scheduled on block exit.
    on_dispatch_exit on_exit = { &io_service_, impl };
    (void)on_exit;

    completion_handler<Handler>::do_complete(
        &io_service_, o, boost::system::error_code(), 0);
  }
}

　　再來看下do_dispatch的代碼：ui

bool strand_service::do_dispatch(implementation_type& impl, operation* op)
{
  // If we are running inside the io_service, and no other handler already
  // holds the strand lock, then the handler can run immediately.
　// 若是沒有其餘handler已經持有strand lock鎖，那麼這個handler就能夠當即執行

  bool can_dispatch = io_service_.can_dispatch();
  impl->mutex_.lock();
  if (can_dispatch && !impl->locked_)
  {
    // Immediate invocation is allowed.
    impl->locked_ = true;
    impl->mutex_.unlock();
    return true;
  }

  if (impl->locked_)
  {
    // Some other handler already holds the strand lock. Enqueue for later.
　　// 若是其餘handler已經持有strand鎖了，那麼放到隊列中

    impl->waiting_queue_.push(op);
    impl->mutex_.unlock();
  }
  else
  {
    // The handler is acquiring the strand lock and so is responsible for
    // scheduling the strand.
    impl->locked_ = true;
    impl->mutex_.unlock();
    impl->ready_queue_.push(op);
    io_service_.post_immediate_completion(impl, false);
  }

  return false;
}

　　經過asio strand的dispatch源代碼，咱們能夠看出來，咱們dispatch的handler是有可能不會被當即執行的。因爲咱們本身以前對於dispatch邏輯的認知錯誤，在dispatch handler以前，咱們就開始準備讀網絡數據，在比較特殊的狀況下，也就是客戶端剛連上，當即端口，那麼咱們讀網絡數據的函數就當即返回錯誤，因爲我本身封裝的Connection是使用shared_ptr作的封裝，若是沒有任何引用，就會析構掉，那麼等咱們以前dispatch的handler從隊列中被執行的時候，以前傳遞的Connection指針已是野指針了，就致使程序crash掉了。this

　　這種偶現的bug，是比較難被測試出來的，一般只有咱們本身進行多樣的壓力測試的時候，才比較容易發現。同時也是告誡本身在使用其餘第三方庫的時候，仍是要更加仔細的弄懂api。