問題描述
最近通知應用在近三個月內出現過2次DNS緩存的問題,第一次在重啓以後一直沒有出現過問題,因此也沒有去重視,可是最近又出現過一次,看來頗有必要完全排查一次;具體的錯誤日誌以下:java
2018-03-16 18:53:59,501 ERROR [DefaultMessageListenerContainer-1] (com.bill99.asap.service.CryptoClient.seal(CryptoClient.java:34))- null java.lang.NullPointerException at java.net.InetAddress$Cache.put(InetAddress.java:779) ~[?:1.7.0_79] at java.net.InetAddress.cacheAddresses(InetAddress.java:858) ~[?:1.7.0_79] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1334) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName0(InetAddress.java:1248) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName(InetAddress.java:1164) ~[?:1.7.0_79] at java.net.InetAddress.getAllByName(InetAddress.java:1098) ~[?:1.7.0_79] at java.net.InetAddress.getByName(InetAddress.java:1048) ~[?:1.7.0_79] at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220) ~[?:1.7.0_79] at sun.net.NetworkClient.doConnect(NetworkClient.java:180) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.<init>(HttpClient.java:211) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.New(HttpClient.java:308) ~[?:1.7.0_79] at sun.net.www.http.HttpClient.New(HttpClient.java:326) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:997) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:851) ~[?:1.7.0_79] at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1092) ~[?:1.7.0_79] at org.springframework.ws.transport.http.HttpUrlConnection.getRequestOutputStream(HttpUrlConnection.java:81) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.AbstractSenderConnection$RequestTransportOutputStream.createOutputStream(AbstractSenderConnection.java:101) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.TransportOutputStream.getOutputStream(TransportOutputStream.java:41) ~[spring-ws-core.jar:1.5.6] at org.springframework.ws.transport.TransportOutputStream.write(TransportOutputStream.java:60) ~[spring-ws-core.jar:1.5.6]
具體表現就是出現此異常以後連續的出現大量此異常,同時系統節點不可用;git
問題分析
1.既然InetAddress$Cache.put報空指針,那就具體看一下源代碼:github
if (policy != InetAddressCachePolicy.FOREVER) { // As we iterate in insertion order we can // terminate when a non-expired entry is found. LinkedList<String> expired = new LinkedList<>(); long now = System.currentTimeMillis(); for (String key : ) { CacheEntry entry = cache.get(key); if (entry.expiration >= 0 && entry.expiration < now) { expired.add(key); } else { break; } } for (String key : expired) { cache.remove(key); } }
報空指針的的地方就是entry.expiration,也就是說從cache取出來的entry爲null,能夠查看cache寫入的地方:spring
CacheEntry entry = new CacheEntry(addresses, expiration); cache.put(host, entry);
每次都是new一個CacheEntry而後再put到cache中,不會寫入null進去;此時猜想是多線程引起的問題,cache.keySet()在遍歷的時候同時也進行了remove操做,致使cache.get(key)到一個空值,查看源代碼能夠發現一共有兩次對cache進行remove的地方,分別是put方法和get方法,put方法代碼如上,每次在遍歷的時候檢測是否過時,而後統一進行remove操做;還有一處就是get方法,代碼以下:緩存
public CacheEntry get(String host) { int policy = getPolicy(); if (policy == InetAddressCachePolicy.NEVER) { return null; } CacheEntry entry = cache.get(host); // check if entry has expired if (entry != null && policy != InetAddressCachePolicy.FOREVER) { if (entry.expiration >= 0 && entry.expiration < System.currentTimeMillis()) { cache.remove(host); entry = null; } } return entry; }
相似put方法也是每次在get的時候進行有效期檢測,而後進行remove操做;
因此若是出現多線程問題大概就是:1.同時調用put,get方法,2.多個線程都調用put方法;繼續查看源碼調用put和get的地方,一共有三處分別是:安全
private static void cacheInitIfNeeded() { assert Thread.holdsLock(addressCache); if (addressCacheInit) { return; } unknown_array = new InetAddress[1]; unknown_array[0] = impl.anyLocalAddress(); addressCache.put(impl.anyLocalAddress().getHostName(), unknown_array); addressCacheInit = true; } /* * Cache the given hostname and addresses. */ private static void cacheAddresses(String hostname, InetAddress[] addresses, boolean success) { hostname = hostname.toLowerCase(); synchronized (addressCache) { cacheInitIfNeeded(); if (success) { addressCache.put(hostname, addresses); } else { negativeCache.put(hostname, addresses); } } } /* * Lookup hostname in cache (positive & negative cache). If * found return addresses, null if not found. */ private static InetAddress[] getCachedAddresses(String hostname) { hostname = hostname.toLowerCase(); // search both positive & negative caches synchronized (addressCache) { cacheInitIfNeeded(); CacheEntry entry = addressCache.get(hostname); if (entry == null) { entry = negativeCache.get(hostname); } if (entry != null) { return entry.addresses; } } // not found return null; }
cacheInitIfNeeded只在cacheAddresses和getCachedAddresses方法中被調用,用來檢測cache是否已經被初始化了;而另外兩個方法都加了對象鎖addressCache,因此不會多線程問題;多線程
2.猜想外部直接調用了addressCache,沒有使用內部提供的方法
查看源碼能夠發現addressCache自己是私有屬性,也不存在對外的訪問方法dom
private static Cache addressCache = new Cache(Cache.Type.Positive);
那業務代碼中應該也不能直接使用,除非使用反射的方式,隨手搜了一下全局代碼查看關鍵字」addressCache」,搜到了相似以下代碼:ide
static{ Class clazz = java.net.InetAddress.class; final Field cacheField = clazz.getDeclaredField("addressCache"); cacheField.setAccessible(true); final Object o = cacheField.get(clazz); Class clazz2 = o.getClass(); final Field cacheMapField = clazz2.getDeclaredField("cache"); cacheMapField.setAccessible(true); final Map cacheMap = (Map)cacheMapField.get(o); }
經過反射的方式獲取了addressCache對象,而後又獲取了cache對象(cache是一個LinkedHashMap),同時提供了一個相似以下的方法:this
public class TEst { public static void main(String[] args) throws IOException, InterruptedException { final LinkedHashMap<Integer, HH> map = new LinkedHashMap<>(); new Thread(new Runnable() { @Override public void run() { for (int i = 0; i < 2000; i++) { map.put(new Random().nextInt(1000), new HH(new Random(100).nextInt())); } } }).start(); for (int i = 0; i < 100; i++) { new Thread(new Runnable() { @Override public void run() { for (int i = 0; i < 500; i++) { map.remove(new Random().nextInt(1000)); } } }).start(); } Thread.sleep(2000); System.out.println("size=" + map.keySet().size() + "," + map.keySet()); for (Integer s : map.keySet()) { System.out.println(map.get(s)); } } } class HH { private int k; public HH(int k) { this.k = k; } public int getK() { return k; } public void setK(int k) { this.k = k; } }
模擬單線程put操做,業務端會有多條線程同時remove操做,執行看輸出結果(能夠執行屢次看結果):
size=0,[121, 517, 208] null null null
能夠發現會出現猜想的狀況,HashMap中的size屬性自己不是線程安全的,因此多線程的狀況下有可能出現0,這樣致使get方法獲取都爲null,固然HashMap還有不少其餘的多線程問題,由於HashMap也不是爲多線程準備的,至此大概瞭解了緣由。
問題解決
給反射獲取的cache對象加上和cacheAddresses方法一樣的鎖,或者直接不在業務代碼中處理cache對象;能夠借鑑一下阿里在github開源的操做dns緩存的項目:https://github.com/alibaba/ja...
總結本次排查問題花了一些時間在排查是否是jdk提供的類是否是有bug,這實際上是有些浪費時間的;還有就是在排查問題中不要放過任何一種可能每每問題就發生在那些理所固然的地方。