【原創】使用 rabbitmq 中 heartbeat 功能可能會遇到的問題


【問題場景】
      客戶端以 consumer 身份訂閱到 rabbitmq server 上的 queue 上,客戶端側在 AMQP 協議的 Connection.Tune-Ok 信令中,設置 heartbeat 爲 0,即要求服務器側不啓用 heartbeat 功能。服務器因爲異常斷電緣由中止服務,結果客戶端在短期內沒法感知到服務器端已經異常。

       剛剛出現這個問題時,就有測試人員和業務人員找到我這邊說:通過改造的 rabbitmq-c 庫可能存在重大 bug,服務器都關閉了,客戶端怎麼還那像什麼都沒發生同樣繼續工做着呢?聽到這種疑問,我只問了兩個問題就想到了答案:
  • 業務中是否是僅僅做爲 consumer 運行的?
  • 服務器可否確認是由於異常斷電致使中止服務?
  • 服務器和業務程序之間是否還有中間路由設備?
業務人員告訴我上述問題的答案分別是:是的、是的、沒有。呵呵~~因此答案就已經肯定了,你想到了麼?

【問題分析】
這個問題能夠從如下兩個層面進行分析:
1. TCP 協議層面
      在此層面上講,上述問題屬於典型的 TCP 協議中的「半打開」問題,典型描述以下:
若是一方已經關閉或異常終止鏈接而另外一方卻還不知道,咱們將這樣的 TCP 鏈接稱爲半打開(Half-Open)的。任何一端的主機異常均可能致使發生這種狀況。只要不打算在半打開鏈接上傳輸數據,仍處於鏈接狀態的一方就不會檢測另外一方已經出現異常。
半打開鏈接的一個常見緣由是,當客戶主機忽然掉電,而不是正常的結束客戶應用程序後再關機。固然這裏所謂的客戶機並非僅僅表示客戶端。
      在這種狀況發生時,做爲 TCP 鏈路上只接收不發送數據的一方,只能依靠 TCP 協議自己的 keepalive 機制來檢查鏈路是否處於正常狀態。而一般 keepalive 機制下,須要大約 2 個小時時間才能觸發。

2. AMQP 協議層面
      在此層面上講,客戶端因爲是做爲 consumer 訂閱到 queue 上的,因此在該 AMQP/TCP 鏈接上客戶端不會主動發送數據到 rabbitmq server 側。當服務器因爲異常斷電中止服務後,consumer 不會接收到 AMQP 協議層面的終止信令,因此沒法感知對端的狀況。
      一種可能的解決辦法是客戶端側在接收 N 次超時後,經過發送 AMQP 協議中的 Heartbeat 信令檢測服務器端是否處於正常狀態。


      在場景描述中說道「客戶端側在 AMQP 協議的 Connection.Tune-Ok 信令中,設置 heartbeat 爲 0」,若是是將 heartbeat 設置爲 30 會如何?答案是會同時觸發服務器端和客戶端的 heartbeat 功能,即服務器端會在一段時間內沒有數據須要發送給客戶端的狀況下,發送一個心跳包給客戶端;或者一段時間內沒有收到任何數據,則斷定爲心跳超時,最終會關閉tcp鏈接(參考這裏)。而客戶端側一樣會觸發對發送和接收 heartbeat 計時器的維護,分別用於斷定發送和接收的超時狀況。

在 amqp.h 頭文件中能夠看到目前 rabbitmq-c 對 heartbeat 的支持狀況:
* \param [in] heartbeat the number of seconds between heartbeat frame to 
 *             request of the broker. A value of 0 disables heartbeats. 
 *             Note rabbitmq-c only has partial support for hearts, as of 
 *             v0.4.0 heartbeats are only serviced during amqp_basic_publish(), 
 *             and amqp_simple_wait_frame()/amqp_simple_wait_frame_noblock()
目前 github 上的 rabbitmq-c 0.4.1 版本在 heartbeat 功能上的支持僅限上述 3 種 API。

      因此,須要解決的問題能夠描述爲: 客戶端做爲 consumer 訂閱到服務器上的 queue 後,在無業務數據須要處理時,須要經過檢測 Heartbeat 幀(信令)來斷定服務器是否處於異常狀態(換句話說,本身是否已是「半打開」的 TCP 鏈接)。


【解決辦法】
建議的解決辦法以下:
  • 客戶端必須啓用 heartbeat 功能(解決「半打開」問題的基礎); 
  • 客戶端須要支持在發送空閒時,發送 heartbeat 的功能(由於目前客戶端做爲 producer 是長鏈接到 rabbitmq server 上的); 
  • 客戶端須要支持在接收空閒時,經過檢測服務器端發送來的 heartbeat 幀來斷定服務器端(或網絡)是否處於正常狀態(由於客戶端做爲 consumer 也是長鏈接到 rabbitmq server 上的,同時不會主動向 rabbitmq server 發送數據)。 

總結:
      只要客戶端啓用 heartbeat ,那麼服務器就會在知足「必定條件」時,定時向客戶端發送 heartbeat 信令,同時也會檢測在空閒狀態達到規定時間後是否收到 heartbeat 信令;而客戶端側做爲 consumer 時,須要斷定是否接收到數據(不管是常規數據仍是 heartbeat 信令),若在必定時間內沒有接收到數據,則認爲當前鏈路可能存在問題。後續能夠從業務上觸發 consume 關係的從新創建


      以下爲使能了 heartbeat 功能後的打印輸出:
做爲 consumer 的狀況下出現網絡斷開時的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:53144]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(53144)
drive_machine: [conn_connected]  ---  connected on socket(53144)
53144: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
53144: conn_state change   snd_protocol_header ==> rcv_connection_start_method
[53144] drive_machine: wait for Connection.Start method another 10 seconds!!
  <-- Recv Connection.Start Method frame!
53144: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
53144: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
  <-- Recv Connection.Tune Method frame!
53144: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
53144: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
53144: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
  <-- Recv Connection.Open-Ok Method frame!
53144: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
53144: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[53144] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
53144: conn_state change   rcv_channel_open_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Declaring!
53144: conn_state change   idle ==> snd_queue_declare_method
  --> Send Queue.Declare Method frame!
53144: conn_state change   snd_queue_declare_method ==> rcv_queue_declare_rsp_method
[53144] drive_machine: wait for Queue.Declare-Ok method another 10 seconds!!
  <-- Recv Queue.Declare-Ok Method frame!
53144: conn_state change   rcv_queue_declare_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Queue Binding!
53144: conn_state change   idle ==> snd_queue_bind_method
  --> Send Queue.Bind Method frame!
53144: conn_state change   snd_queue_bind_method ==> rcv_queue_bind_rsp_method
[53144] drive_machine: wait for Queue.Bind method another 10 seconds!!
  <-- Recv Queue.Bind Method frame!
need to code something!
53144: conn_state change   rcv_queue_bind_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic QoS!
53144: conn_state change   idle ==> snd_basic_qos_method
  --> Send Basic.Qos Method frame!
53144: conn_state change   snd_basic_qos_method ==> rcv_basic_qos_rsp_method
  <-- Recv Queue.Qos-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_qos_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Basic Consuming!
53144: conn_state change   idle ==> snd_basic_consume_method
  --> Send Basic.Consume Method frame!
53144: conn_state change   snd_basic_consume_method ==> rcv_basic_consume_rsp_method
  <-- Recv Basic.Consume-Ok Method frame!
need to code something!
53144: conn_state change   rcv_basic_consume_rsp_method ==> idle
[53144] drive_machine: [conn_idle]  ---  [CONSUMER]: Start waiting to recv!
53144: conn_state change   idle ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
      ### Recv AMQP_FRAME_HEARTBEAT frame! ###
  <-- Recv Heartbeat frame!
53144: conn_state change   rcv_basic_deliver_method ==> snd_heartbeat
  --> Send Heartbeat frame!
53144: conn_state change   snd_heartbeat ==> rcv_basic_deliver_method
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: wait for Basic.Deliver method another 10 seconds!!
[53144] drive_machine: Recv nothing for 60s!
[53144] drive_machine: Maybe network broken or rabbitmq server fucked! Plz retry consuming!
53144: conn_state change   rcv_basic_deliver_method ==> close
[53144] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]



做爲 producer 的狀況下出現網絡斷開時的打印
[warn] evsignal_init: socketpair: No error
drive_machine: [conn_init]  ---  TCP 3-way handshake start! --> [172.16.81.111:5672][s:12184]
drive_machine: [conn_connecting]  ---  connection timeout 1 time on socket(12184)
drive_machine: [conn_connected]  ---  connected on socket(12184)
12184: conn_state change   connected ==> snd_protocol_header
  --> Send Protocol.Header!
12184: conn_state change   snd_protocol_header ==> rcv_connection_start_method
  <-- Recv Connection.Start Method frame!
12184: conn_state change   rcv_connection_start_method ==> snd_connection_start_rsp_method
  --> Send Connection.Start-Ok Method frame!
12184: conn_state change   snd_connection_start_rsp_method ==> rcv_connection_tune_method
[12184] drive_machine: wait for Connection.Tune method another 10 seconds!!
  <-- Recv Connection.Tune Method frame!
12184: conn_state change   rcv_connection_tune_method ==> snd_connection_tune_rsp_method
  --> Send Connection.Tune-Ok Method frame!
12184: conn_state change   snd_connection_tune_rsp_method ==> snd_connection_open_method
  --> Send Connection.Open Method frame!
12184: conn_state change   snd_connection_open_method ==> rcv_connection_open_rsp_method
[12184] drive_machine: wait for Connection.Open-Ok method another 10 seconds!!
  <-- Recv Connection.Open-Ok Method frame!
12184: conn_state change   rcv_connection_open_rsp_method ==> snd_channel_open_method
  --> Send Channel.Open Method frame!
12184: conn_state change   snd_channel_open_method ==> rcv_channel_open_rsp_method
[12184] drive_machine: wait for Channel.Open-Ok method another 10 seconds!!
  <-- Recv Channel.Open-Ok Method frame!
12184: conn_state change   rcv_channel_open_rsp_method ==> snd_channel_confirm_select_method
  --> Send Confirm.Select Method frame!
12184: conn_state change   snd_channel_confirm_select_method ==> rcv_channel_confirm_select_rsp_method
  <-- Recv Confirm.Select-Ok Method frame!
Channel in Confirm Mode!
12184: conn_state change   rcv_channel_confirm_select_rsp_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find msg to send!
12184: conn_state change   idle ==> snd_basic_publish_method
  --> Send Basic.Publish Method frame!
12184: conn_state change   snd_basic_publish_method ==> snd_basic_content_header
  --> Send Content-Header frame!
12184: conn_state change   snd_basic_content_header ==> snd_basic_content_body
  --> Send Content-Body frame!
12184: conn_state change   snd_basic_content_body ==> rcv_basic_ack_method
  <-- Recv Basic.Ack Method frame!
### CB: Publisher Confirm -- [Basic.Ack]  Delivery_Tag:[1]  multiple:[0]
12184: conn_state change   rcv_basic_ack_method ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
  --> Send Heartbeat frame!
12184: conn_state change   snd_heartbeat ==> idle
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
[12184] drive_machine: [conn_idle]  ---  [PRODUCER]: Find no msg to send! wait another 1 seconds
12184: conn_state change   idle ==> snd_heartbeat
[12184] drive_machine: Send Heartbeat failed! status = -9
12184: conn_state change   snd_heartbeat ==> close
[12184] drive_machine: [conn_close]  ---  Connection Disconnect!
### CB: Connection Disconnect!    Msg : [Connection Disconnect]
相關文章
相關標籤/搜索