本文基於Spring Cloud Fincheley SR4, Arthas 3.1.4java
最近發現業務上返回慢,而且feign的fallback被觸發了。查看日誌,發現觸發了重試,是什麼觸發的重試呢,經過異常堆棧發現:mysql
Caused by: feign.RetryableException: connect timed out executing GET http://test-service/test-api at feign.FeignException.errorExecuting(FeignException.java:65) ~[feign-core-9.7.0.jar!/:?] at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:105) ~[feign-core-9.7.0.jar!/:?] at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77) ~[feign-core-9.7.0.jar!/:?] at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java:107) ~[feign-hystrix-9.7.0.jar!/:?] at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302) ~[hystrix-core-1.5.18.jar!/:1.5.18] at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298) ~[hystrix-core-1.5.18.jar!/:1.5.18] at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:46) ~[rxjava-1.3.8.jar!/:1.3.8] ... 27 more Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?] at java.net.Socket.connect(Socket.java:591) ~[?:?] at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?] at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?] at sun.net.www.http.HttpClient.<init>(HttpClient.java:242) ~[?:?] at sun.net.www.http.HttpClient.New(HttpClient.java:341) ~[?:?] at sun.net.www.http.HttpClient.New(HttpClient.java:362) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1248) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) ~[?:?] at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1515) ~[?:?] at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:527) ~[?:?] at feign.Client$Default.convertResponse(Client.java:150) ~[feign-core-9.7.0.jar!/:?] at feign.Client$Default.execute(Client.java:72) ~[feign-core-9.7.0.jar!/:?] at org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient.execute(TracingFeignClient.java:91) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.sleuth.instrument.web.client.feign.LazyTracingFeignClient.execute(LazyTracingFeignClient.java:55) ~[spring-cloud-sleuth-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:103) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287) ~[spring-retry-1.2.4.RELEASE.jar!/:?] at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:180) ~[spring-retry-1.2.4.RELEASE.jar!/:?] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:88) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at org.springframework.cloud.openfeign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:54) ~[spring-cloud-openfeign-core-2.0.3.RELEASE.jar!/:2.0.3.RELEASE] at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:104) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287) ~[ribbon-loadbalancer-2.2.5.jar!/:2.2.5] at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:231) ~[rxjava-1.3.8.jar!/:1.3.8] at rx.internal.util.ScalarSynchronousObservable$3.call(ScalarSynchronousObservable.java:228) ~[rxjava-1.3.8.jar!/:1.3.8] at rx.Observable.unsafeSubscribe(Observable.java:10327) ~[rxjava-1.3.8.jar!/:1.3.8]
發現調用微服務test-service鏈接超時,注意是鏈接超時,而不是讀取超時。接下來,首先查看ribbon鏈接超時配置,看是否配置的鏈接超時太短,發現是:react
#ribbon默認鏈接超時 ribbon.ConnectTimeout=500
500ms創建tcp鏈接已經不短了,因此配置沒有問題。從當前實例手動調用微服務實例的接口,看是不是通的:git
curl http://test-service的ip:test-service的端口
發現訪問成功,沒有阻塞。證實網絡鏈接沒有問題,咱們來看下netstat
網絡鏈接狀態:github
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 test-5c665cc496-6jkx5:54644 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54666 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54680 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54088 ip-10-238-8-68.eu-central-1.compute.internal:8211 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54674 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54682 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54652 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47092 244-213-201-91.test.com:https CLOSE_WAIT tcp 0 0 test-5c665cc496-6jkx5:54662 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54624 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47098 244-213-201-91.test.com:https CLOSE_WAIT tcp 0 0 test-5c665cc496-6jkx5:54672 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54630 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:58758 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:8251 10-238-8-117.api-gateway.test1.svc.cluster.local:43132 FIN_WAIT2 tcp 0 0 test-5c665cc496-6jkx5:54648 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:58778 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54646 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54628 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54650 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54632 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:54638 ip-10-238-8-119.eu-central-1.compute.internal:6379 ESTABLISHED tcp 0 0 test-5c665cc496-6jkx5:59434 ip-10-238-9-71.eu-central-1.compute.internal:mysql ESTABLISHED tcp 32 0 test-5c665cc496-6jkx5:47104 244-213-201-91.test.com:https CLOSE_WAIT tcp 1 0 test-5c665cc496-6jkx5:54146 ip-10-238-8-68.eu-central-1.compute.internal:8211 CLOSE_WAIT tcp 32 0 test-5c665cc496-6jkx5:47100 244-213-201-91.test.com:https CLOSE_WAIT tcp 32 0 test-5c665cc496-6jkx5:47106 244-213-201-91.test.com:https CLOSE_WAIT
發現:web
猜測,多是調用的service-test的ip地址不是最新的。spring
如今該Athas登場了,啓動Arthas,咱們查看下feign調用的真實ip: 因爲咱們使用了sleuth,因此這裏用sleuth監控的feigh client去看調用ip:sql
watch org.springframework.cloud.sleuth.instrument.web.client.feign.TracingFeignClient execute params[0].url()
對應的代碼是:api
package org.springframework.cloud.sleuth.instrument.web.client.feign; final class TracingFeignClient implements Client { @Override public Response execute(Request request, Request.Options options) throws IOException { //..... } }
經過觀察,咱們發現,確實是之前的ip。 那麼爲何沒有更新呢,咱們來查看下EurekaClient相關的代碼,參考我另外一篇文章:Spring Cloud Eureka 全解 (4) - 核心流程-服務與實例列表獲取詳解. 咱們來看下PollingServerListUpdater
這個類,負責更新的線程池是:網絡
int coreSize = poolSizeProp.get(); ThreadFactory factory = (new ThreadFactoryBuilder()) .setNameFormat("PollingServerListUpdater-%d") .setDaemon(true) .build(); _serverListRefreshExecutor = new ScheduledThreadPoolExecutor(coreSize, factory);
謝天謝地,這個線程是有名字的。咱們經過Arthas看下線程列表:
[arthas@24]$ thread thread Threads Total: 736, NEW: 0, RUNNABLE: 81, BLOCKED: 0, WAITING: 634, TIMED_WAITIN G: 21, TERMINATED: 0 ID NAME GROUP PRIORI STATE %CPU TIME INTER DAEMON 11 Log4j2-TF-2-AsyncLo main 5 TIMED 39 138:46 false true 17 Log4j2-TF-5-AsyncLo main 5 TIMED 37 137:27 false true 1045 as-command-execute- system 10 RUNNA 19 0:0 false true 1027 AsyncAppender-Worke system 9 WAITI 0 0:0 false true 68 AsyncResolver-boots main 5 TIMED 0 0:0 false true 264 AsyncResolver-boots main 5 WAITI 0 0:0 false true 1025 Attach Listener system 9 RUNNA 0 0:0 false true 9 Common-Cleaner InnocuousThr 8 TIMED 0 0:0 false true 151 DataPublisher main 5 TIMED 0 0:2 false true 107 DestroyJavaVM main 5 RUNNA 0 0:37 false false 69 DiscoveryClient-0 main 5 TIMED 0 0:1 false true 70 DiscoveryClient-1 main 5 WAITI 0 0:1 false true 748 DiscoveryClient-Cac main 5 WAITI 0 0:9 false true 751 DiscoveryClient-Hea main 5 WAITI 0 0:7 false true 71 DiscoveryClient-Ins main 5 TIMED 0 0:0 false true 24 Druid-ConnectionPoo main 5 WAITI 0 0:0 false true 21 Druid-ConnectionPoo main 5 WAITI 0 0:0 false true 25 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 22 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 20 Druid-ConnectionPoo main 5 TIMED 0 0:0 false true 67 Eureka-JerseyClient main 5 WAITI 0 0:0 false true 3 Finalizer system 8 WAITI 0 0:0 false true 119 ForkJoinPool.common main 5 WAITI 0 0:9 false true 30 I/O dispatcher 1 main 5 RUNNA 0 0:2 false false 155 NFLoadBalancer-Ping main 5 TIMED 0 0:0 false true 150 PollingServerListUp main 5 WAITI 0 0:0 false true 157 PollingServerListUp main 5 WAITI 0 0:0 false true 2 Reference Handler system 10 RUNNA 0 0:0 false true 146 RibbonApacheHttpCli main 5 TIMED 0 0:0 false true 135 RxComputationSchedu main 5 TIMED 0 1:44 false true 132 RxIoScheduler-1 (Ev main 5 TIMED 0 0:0 false true 4 Signal Dispatcher system 9 RUNNA 0 0:0 false true 114 SimplePauseDetector main 5 TIMED 0 2:10 false true 115 SimplePauseDetector main 5 TIMED 0 2:10 false true 116 SimplePauseDetector main 5 TIMED 0 2:12 false true
發現關鍵字PollingServerListUp
有兩個線程,150和157,分別看看他們在幹什麼:
[arthas@24]$ thread 150 thread 150 "PollingServerListUpdater-0" Id=150 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@778af81b at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1170) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:899) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1054) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.4/java.lang.Thread.run(Thread.java:834) Affect(row-cnt:0) cost in 322 ms. [arthas@24]$ thread 157 thread 157 "PollingServerListUpdater-1" Id=157 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2 at java.base@11.0.4/jdk.internal.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@3adcda2 at java.base@11.0.4/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917) at java.base@11.0.4/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240) at java.base@11.0.4/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959) at com.netflix.loadbalancer.BaseLoadBalancer.setServersList(BaseLoadBalancer.java:475) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.setServersList(DynamicServerListLoadBalancer.java:156) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateAllServerList(DynamicServerListLoadBalancer.java:267) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.updateListOfServers(DynamicServerListLoadBalancer.java:250) at com.netflix.loadbalancer.DynamicServerListLoadBalancer$1.doUpdate(DynamicServerListLoadBalancer.java:62) at com.netflix.loadbalancer.PollingServerListUpdater$1.run(PollingServerListUpdater.java:116) at java.base@11.0.4/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base@11.0.4/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base@11.0.4/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.4/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.4/java.lang.Thread.run(Thread.java:834) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@6ab1c5f9
發現PollingServerListUpdater-1
一直等待獲取鎖,看來問題就在這裏了,查看對應的Ribbon代碼,發現:
PollingServerListUpdater-1
須要獲取allServerLock的寫鎖那麼是否是Ping出了問題,咱們看下對應的獲取鎖的代碼BaseLoadBalancer
:
public void runPinger() throws Exception { //省略無用代碼 allLock = allServerLock.readLock(); allLock.lock(); allServers = allServerList.toArray(new Server[allServerList.size()]); allLock.unlock();
咱們看到,這裏的代碼,並無用try{lock} finally {unlock}
的套路,若是中間代碼異常,則會鎖不能釋放。 而且這裏是重入鎖,lock多少次就要unlock多少次,少一次,其餘線程都獲取不到。
allServers = allServerList.toArray(new Server[allServerList.size()]);
並無涉及任何IO,只是數據變換,也沒有空指針異常。猜測是發生了OOM異常,致使內存沒有分配。檢查日誌,果真發現了OOM。
這件事告訴咱們,對於鎖,必定要try{lock} finally {unlock}
。就算代碼不會拋出任何異常,發生OOM時,也有可能致使鎖不能釋放
感受這個代碼仍是修一下吧,提了個issue給ribbon