redis close by server 的緣由分析

情景,5個agent 啓動報大量異常,異常以下:python

2017-11-24 13:57:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',)
Traceback (most recent call last):
  File "./inject_agent/agent.py", line 78, in execute
    self._execute()
  File "./inject_agent/agent.py", line 87, in _execute
    task = self.get_task()
  File "./inject_agent/agent.py", line 44, in get_task
    body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"])
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop
    return self.execute_command('LPOP', name)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect
    self.on_connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect
    if nativestr(self.read_response()) != 'OK':
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket
    (e.args,))

機器以下是阿里服務器,上面部署了臺redis 。 redis

情形是我5臺機器去redis 取任務消費,每一個agent 同時取100個線程!bash

    如今有個問題服務器

reids 爲何會主動關閉 socket。併發

    首先,懷疑,redis問題, 鏈接過多,由於同時啓動500個線程去鏈接,可是,redis 支持幾萬併發讀是沒問題的,查詢後得知,數量在2000 左右,並非問題所在。python2.7

查詢同時間,服務器日誌:socket

Feb 18 12:28:38 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:44 i-*** kernel: printk: 227 messages suppressed.
Feb 18 12:28:44 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:52 i-*** kernel: printk: 121 messages suppressed.
Feb 18 12:28:52 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:53 i-*** kernel: printk: 351 messages suppressed.
Feb 18 12:28:53 i-*** kernel: TCP: time wait bucket table overflow
Feb 18 12:28:59 i-*** kernel: printk: 319 messages suppressed.

    明顯,timewait 數量超過限制,一查,果真,阿里內核的限制net.ipv4.tcp_max_tw_bucketstcp

設的是5000。而本身服務器設置的是200萬。修復參數,服務恢復正常。線程

    而後發現,這種狀況會出現任務丟失,agent 沒接受到,可是server 已經丟了(業務邏輯可肯定倒是丟了),以前在rabbitmq 的使用過程也遇到了相似問題,因此,在及其重要的業務中,須要注意使用方式,使用ack 之類。日誌

相關文章
相關標籤/搜索