情景,5個agent 啓動報大量異常,異常以下:python
2017-11-24 13:57:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',) Traceback (most recent call last): File "./inject_agent/agent.py", line 78, in execute self._execute() File "./inject_agent/agent.py", line 87, in _execute task = self.get_task() File "./inject_agent/agent.py", line 44, in get_task body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"]) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop return self.execute_command('LPOP', name) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command connection.send_command(*args) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command self.send_packed_command(self.pack_command(*args)) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command self.connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect self.on_connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect if nativestr(self.read_response()) != 'OK': File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response response = self._buffer.readline() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline self._read_from_socket() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket (e.args,))
機器以下是阿里服務器,上面部署了臺redis 。 redis
情形是我5臺機器去redis 取任務消費,每一個agent 同時取100個線程!bash
如今有個問題服務器
reids 爲何會主動關閉 socket。併發
首先,懷疑,redis問題, 鏈接過多,由於同時啓動500個線程去鏈接,可是,redis 支持幾萬併發讀是沒問題的,查詢後得知,數量在2000 左右,並非問題所在。python2.7
查詢同時間,服務器日誌:socket
Feb 18 12:28:38 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:44 i-*** kernel: printk: 227 messages suppressed. Feb 18 12:28:44 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:52 i-*** kernel: printk: 121 messages suppressed. Feb 18 12:28:52 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:53 i-*** kernel: printk: 351 messages suppressed. Feb 18 12:28:53 i-*** kernel: TCP: time wait bucket table overflow Feb 18 12:28:59 i-*** kernel: printk: 319 messages suppressed.
明顯,timewait 數量超過限制,一查,果真,阿里內核的限制net.ipv4.tcp_max_tw_buckets
tcp
設的是5000。而本身服務器設置的是200萬。修復參數,服務恢復正常。線程
而後發現,這種狀況會出現任務丟失,agent 沒接受到,可是server 已經丟了(業務邏輯可肯定倒是丟了),以前在rabbitmq 的使用過程也遇到了相似問題,因此,在及其重要的業務中,須要注意使用方式,使用ack 之類。日誌