一個作BI
數據展現的服務在一個晚上重啓了5次,因爲是經過k8s容器編排,服務掛了之後會自動重啓,因此服務還能繼續提供服務。java
第一時間先上日誌系統查看錯誤日誌,發現以下報錯:react
java.lang.OutOfMemoryError ERROR java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase.start(CloseableHttpAsyncClientBase.java:83) at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:190) at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_realtime_trend_data_from_es(JbdsImpl.java:637) at com.xiaohongshu.fls.jbds.handler.JbdsImpl.get_seller_trend(JbdsImpl.java:316) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$FastClassBySpringCGLIB$$bd2466f7.invoke(<generated>) at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:738) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:85) at com.xiaohongshu.fls.jbds.aspect.RpcMethodExceptionAspect.requestControllerLog(RpcMethodExceptionAspect.java:33) at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:629) at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:618) at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:92) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerBySpringCGLIB$$d374b954.get_seller_trend(<generated>) at sun.reflect.GeneratedMethodAccessor81.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.xiaohongshu.infra.rpc.core.ThriftServerBaseProxy.call(ThriftServerBaseProxy.java:119) at com.xiaohongshu.infra.rpc.core.ThriftServerCGlibProxy.intercept(ThriftServerCGlibProxy.java:27) at com.xiaohongshu.fls.jbds.handler.JbdsImpl$$EnhancerByCGLIB$$19473b8f.get_seller_trend(<generated>) at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1450) at com.xiaohongshu.fls.rpc.jbds.JBusinessDataService$Processor$get_seller_trend.getResult(JBusinessDataService.java:1435) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:518) at org.apache.thrift.server.Invocation.run(Invocation.java:18) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
發現是OOM
的錯誤,而且有unable to create new native thread
的錯誤信息,筆者的第一直覺是建立了大量線程從而致使堆外內存空間不足。spring
隨即去CAT
監控系統上去查看線程的活躍狀況,以下圖:apache
發現服務的活躍線程數達到了1W左右,這顯然有問題。服務自己是一個Thrift Server
,開起的worker
線程數爲200,再加上一些其餘IO
線程,總的線程數不會超過300(服務內本身沒有顯示建立線程或者使用線程池)。segmentfault
筆者登陸到線上的Docker實例上,經過jmap -histo:live pid
命令,查看JVM中存活的對象,輸出以下內容:安全
num #instances #bytes class name ---------------------------------------------- 1: 19303 656995672 [B 2: 116670 13942144 [C 3: 15645 5298648 [I 4: 10098 3796848 java.lang.Thread 5: 75817 3639216 java.util.HashMap 6: 113889 2733336 java.lang.String 7: 27521 2421848 java.lang.reflect.Method 8: 68662 2197184 java.util.concurrent.ConcurrentHashMap$Node 9: 10041 1767032 [J 10: 372 1525568 [Ljava.nio.ByteBuffer; 11: 36807 1472280 java.util.LinkedHashMap$Entry 12: 16504 1352488 [Ljava.util.HashMap$Node; 13: 40687 1301984 java.lang.ThreadLocal$ThreadLocalMap$Entry 14: 11495 1277344 java.lang.Class 15: 29903 1196120 java.util.WeakHashMap$Entry 16: 70474 1127584 java.lang.Object 17: 15346 918592 [Ljava.lang.Object; 18: 51556 824896 java.util.HashSet 19: 25528 816896 java.util.HashMap$Node 20: 9989 812176 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry; 21: 9661 792032 [Ljava.util.WeakHashMap$Entry; 22: 9504 760320 org.apache.http.impl.nio.reactor.BaseIOReactor 23: 30551 733224 java.util.concurrent.ConcurrentLinkedQueue$Node 24: 9914 713808 sun.nio.ch.EPollArrayWrapper 25: 9914 713808 sun.nio.ch.EPollSelectorImpl 26: 29682 712368 java.util.concurrent.ConcurrentLinkedQueue 27: 3525 694672 [Ljava.util.concurrent.ConcurrentHashMap$Node; 28: 15664 501248 java.lang.ref.WeakReference 29: 8364 468384 java.util.LinkedHashMap 30: 9656 463488 java.util.WeakHashMap 31: 20064 436536 [Ljava.lang.Class; 32: 9267 370680 java.security.AccessControlContext 33: 4856 349632 java.lang.reflect.Field 34: 10278 328896 java.lang.ref.ReferenceQueue 35: 9914 317248 sun.nio.ch.AllocatedNativeObject 36: 4955 317120 java.util.concurrent.ConcurrentHashMap 37: 7597 303880 java.lang.ref.SoftReference 38: 9245 293568 [Ljava.security.ProtectionDomain; 39: 11484 275616 java.util.ArrayList 40: 3330 239760 org.springframework.core.annotation.AnnotationAttributes 41: 9989 239736 java.lang.ThreadLocal$ThreadLocalMap 42: 9951 238824 java.util.Collections$SynchronizedSet 43: 9922 238128 java.util.BitSet 44: 9504 228096 org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker 45: 5460 207944 [Ljava.lang.String; 46: 2459 196720 java.lang.reflect.Constructor 47: 12103 193648 java.util.HashMap$KeySet 48: 8003 192072 java.beans.MethodRef 49: 3657 175536 org.aspectj.weaver.reflect.ShadowMatchImpl 50: 10339 165424 java.util.concurrent.atomic.AtomicBoolean 51: 10281 164496 java.lang.ref.ReferenceQueue$Lock 52: 10203 163248 java.util.Collections$UnmodifiableSet 53: 2913 163128 java.beans.MethodDescriptor 54: 2547 161536 [Ljava.lang.reflect.Method; 55: 9914 158624 java.nio.channels.spi.AbstractSelector$1 56: 9913 158608 sun.nio.ch.Util$3 57: 5481 131544 org.springframework.core.MethodClassKey 58: 3707 118624 java.util.LinkedList 59: 3637 116384 org.aspectj.weaver.patterns.ExposedState 60: 2047 114632 java.lang.Class$ReflectionData 61: 1155 110880 org.springframework.beans.GenericTypeAwarePropertyDescriptor 62: 399 109984 [Ljava.lang.Thread; 63: 4387 105288 sun.reflect.generics.tree.SimpleClassTypeSignature 64: 1388 99936 java.beans.PropertyDescriptor 65: 4087 90560 [Ljava.lang.reflect.Type; 66: 5361 85776 org.springframework.core.annotation.AnnotationUtils$DefaultValueHolder 67: 1296 82944 io.netty.buffer.PoolSubpage 68: 4387 82520 [Lsun.reflect.generics.tree.TypeArgument; 69: 674 75488 org.springframework.boot.loader.jar.JarEntry 70: 1848 73920 java.util.TreeMap$Entry 71: 3061 73464 sun.reflect.annotation.AnnotationInvocationHandler 72: 2737 65688 java.util.LinkedList$Node 73: 3989 63824 sun.reflect.generics.tree.ClassTypeSignature 74: 858 61776 org.apache.ibatis.mapping.ResultMapping 75: 832 53248 org.springframework.core.MethodParameter 76: 78 51168 io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue 77: 2120 50880 java.util.Collections$UnmodifiableRandomAccessList 78: 3165 50640 java.util.LinkedHashMap$LinkedKeySet 79: 2419 49728 [Lsun.reflect.generics.tree.FieldTypeSignature;
標紅的這個類org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker
很是值得懷疑,有9504個實例,從類名看是一個實現了Reactor
模式的http
客戶端,瞭解Reactor
模式的同窗都知道,其包含了一個accept
線程處理鏈接事件,N
個IO bounding
的線程處理read
、write
事件,和M
個worker
線程處理業務邏輯。多線程
接下來須要查找哪裏使用了這個http客戶端,在build.gradle的文件裏看到依賴了ES
的http client
app
group: 'org.elasticsearch.client', name: 'elasticsearch-rest-client', version: '6.3.2'
經過全局搜索,發現了以下的代碼:dom
private String get_realtime_data_from_es (String seller_id) throws IOException{ Date current_date = new Date( ); SimpleDateFormat sdf = new SimpleDateFormat ("yyyyMMdd"); String date=sdf.format(current_date); RestClient restClient = RestClient.builder( new HttpHost("xxxxxxxxx", 9200, "http"), new HttpHost("xxxxxxxxx", 9201, "http")).build(); org.elasticsearch.client.Response response = restClient.performRequest("GET", "/seller/"+date+"/"+seller_id); restClient.close(); return EntityUtils.toString(response.getEntity()); }
筆者發現這裏調用ES
的RestClient
的很是可疑,竟然是每次調用建立一個對象,而不是使用單例模式。elasticsearch
使用過RestTemplate
或者HttpClient
的同窗都知道,Http
客戶端通常都是經過單例的方式注入到Spring
容器中,客戶端都是線程安全的,多線程狀況下不會出現串包的狀況。(具體線程安全的實現機制,可參考筆者以前的博客:《Http請求鏈接池-HttpClient的AbstractConnPool源碼分析》,地址:https://segmentfault.com/a/11...)
直接進到RestClientBuilder
類的build()
方法中一探究竟,代碼以下:
public RestClient build() { if (failureListener == null) { failureListener = new RestClient.FailureListener(); } CloseableHttpAsyncClient httpClient = AccessController.doPrivileged(new PrivilegedAction<CloseableHttpAsyncClient>() { @Override public CloseableHttpAsyncClient run() { return createHttpClient(); } }); RestClient restClient = new RestClient(httpClient, maxRetryTimeout, defaultHeaders, hosts, pathPrefix, failureListener); httpClient.start(); return restClient; }
build
方法是實際建立RestClient
的地方,設置了超時時間、host等參數。再點進start
方法,他是抽象類CloseableHttpAsyncClient
的一個抽象方法,具體實如今實現類CloseableHttpAsyncClientBase
中,以下:
@Override public void start() { if (this.status.compareAndSet(Status.INACTIVE, Status.ACTIVE)) { if (this.reactorThread != null) { this.reactorThread.start(); } } }
其調用了成員變量reactorThread
的start
方法,而成員變量reactorThread
是一個Thread類,它在構造方法中初始化,以下:
public CloseableHttpAsyncClientBase( final NHttpClientConnectionManager connmgr, final ThreadFactory threadFactory, final NHttpClientEventHandler handler) { super(); this.connmgr = connmgr; if (threadFactory != null && handler != null) { this.reactorThread = threadFactory.newThread(new Runnable() { @Override public void run() { try { final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler); connmgr.execute(ioEventDispatch); } catch (final Exception ex) { log.error("I/O reactor terminated abnormally", ex); } finally { status.set(Status.STOPPED); } } }); } else { this.reactorThread = null; } this.status = new AtomicReference<Status>(Status.INACTIVE); }
可見,每建立一個CloseableHttpAsyncClient
對象,就會建立一個reactorThread
線程,而connmgr.execute(ioEventDispatch)
是一個永久for循環執行的方法,因此線程的run方法不會主動退出,即,reactorThread
線程不會銷燬。
找到bug所在後,結局方案很簡單,只需將ES
的RestClient
經過單例的方式注入到服務中,便可解決堆外內存泄漏的問題。
查找bug時,須要從多個方面去定位,經過儘量多的現場信息去定量分析,監控、線上機器、源碼的查看都須要,深刻下去,會有很多收穫。