kafka0.8.1java
10月22號應用系統突然報錯:spring
[2014/12/22 11:52:32.738]java.net.SocketException: 打開的文件過多apache
[2014/12/22 11:52:32.738] at java.net.Socket.createImpl(Socket.java:447)服務器
[2014/12/22 11:52:32.738] at java.net.Socket.connect(Socket.java:577)併發
[2014/12/22 11:52:32.738] at java.net.Socket.connect(Socket.java:528)app
[2014/12/22 11:52:32.738] at sun.net.NetworkClient.doConnect(NetworkClient.java:180)異步
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)socket
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)ide
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)fetch
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.New(HttpClient.java:308)
[2014/12/22 11:52:32.738] at sun.net.www.http.HttpClient.New(HttpClient.java:326)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:996)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850)
[2014/12/22 11:52:32.738] at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1091)
[2014/12/22 11:52:32.738] at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:50)
[2014/12/22 11:52:32.738] at com.rbc.http.client.RbcHttpClient.callCptService(RbcHttpClient.java:144)
[2014/12/22 11:52:32.738] at com.rbc.http.bean.BaseBean.call(BaseBean.java:54)
[2014/12/22 11:52:32.739] at app.package.APPClass.send2Center(APPClass.java:852)
[2014/12/22 11:52:32.739] at sun.reflect.GeneratedMethodAccessor4312.invoke(Unknown Source)
[2014/12/22 11:52:32.739] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2014/12/22 11:52:32.739] at java.lang.reflect.Method.invoke(Method.java:606)
[2014/12/22 11:52:32.739] at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
[2014/12/22 11:52:32.739] at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
[2014/12/22 11:52:32.739] at com.sun.proxy.$Proxy157.send2Center(Unknown Source)
[2014/12/22 11:52:32.739] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[2014/12/22 11:52:32.739] at java.util.concurrent.FutureTask.run(FutureTask.java:262)
[2014/12/22 11:52:32.739] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[2014/12/22 11:52:32.739] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[2014/12/22 11:52:32.739] at java.lang.Thread.run(Thread.java:745)
發現是由於上午打開一個開關,即應用系統將訪問日誌發送到消息平臺kafka,可是應用系統服務器沒有配置消息平臺kafka的host,致使出現問題;
將kafka客戶端的broker-list配置爲一個不存在的host,而且不斷髮送消息,執行:
lsof –p $pid|wc -l
會發現該java進程的文件描述符不斷增加,其中包含大量的socket鏈接,並最終超過限制,能夠經過:
ulimit -a|grep 'open files'
來查看文件描述符數量限制,超出限制以後,會就不斷報錯:
java.net.SocketException: Too many open files
問題重現;
可是若是將broker-list配置爲一個不存在的ip,則不能重現問題;
Generally, a file descriptor is an index for an entry in a kernel-resident array data structure containing the details of open files. In POSIX this data structure is called a file descriptor table, and each process has its own file descriptor table. The process passes the file descriptor to the kernel through a system call, and the kernel will access the file on behalf of the process. The process itself cannot read or write the file descriptor table directly.
On Linux, the set of file descriptors open in a process can be accessed under the path /proc/PID/fd/, where PID is the process identifier.
ulimit -a|grep 'open files'
查看進程文件描述符限制
cat /proc/sys/fs/file-max
查看系統總的文件描述符限制
lsof –p $pid
查看某個進程具體打開的文件描述符
ulimit –n $num
臨時修改文件描述符限制
爲了不kafka出現問題影響應用系統,使用kafka異步producer,發送消息後會將消息添加到內存隊列並當即返回,另外有一個線程不斷處理內存隊列中的消息併發送到kafka:
當kafka服務器出現問題,消息發送不出去的時候,只要配置queue.enqueue.timeout.ms=0,則內存隊列滿時消息直接拋棄掉,不會阻塞:
Kafka客戶端發送消息前,首先會經過broker-list中的配置鏈接到任一臺broker獲取metadata:
ERROR [2014-12-22 14:05:04,687] (Logging.scala:103) - Producer connection to $kafka_server_1:9096 unsuccessful
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:127)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:613)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.producer.SyncProducer.connect(SyncProducer.scala:141)
at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:156)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
在這個過程報錯UnresolvedAddressException,SyncProducer部分代碼以下:
BlockingChannel代碼以下:
發現問題緣由:BlockingChannel的connect方法中,channel.connect這行拋異常,此時connected爲false,可是SyncProducer的disconnect中只有當BlockingChannel的connected爲true時纔會調用BlockingChannel的disconnect方法,因此致使一直在調用Blockingchannel的connect方法而沒有調用disconnect方法,而connect方法中會調用SocketChannel.open方法:
->
->
->
可見每次調用SocketChannel.open方法都會建立一個文件描述符,這樣文件描述符會不斷增長,爲何只有配host時出問題,而配ip時沒有問題,SocketChannelImpl部分代碼:
當配host時,是Net.checkAddress時拋UnresolvedAddressException異常,這時直接退出;當配ip時,是在Net.connect時拋ConnectionTimeoutException異常,這時會有後續的finally和catch代碼執行,其中readerCleanup代碼以下:
->
可見雖然沒有調用disconnect,也會關閉文件描述符,因此配ip時不會出現文件描述符過多的問題;
修改kafka.network.SocketChannel代碼以下:
增長try-catch,並在UnresolvedAddressException異常發生時調用disconnect,問題解決;
這是當時發現問題以後在kafka jira上提交的issues: https://issues.apache.org/jira/browse/KAFKA-1832