本文首發於 vivo互聯網技術 微信公衆號
連接: https://mp.weixin.qq.com/s/8f34CaTp--Wz5pTHKA0Xeg
做者:vivo 官網商城開發團隊
衆所周知,Oracle JDK 是 Java 語言的絕對權威,不少時候 JDK 與 Java 語言近似一個概念。但咱們始終要保持實事求是的精神,勇於質疑。本文記錄了一次線上troubleshoot 實戰,包含問題分析、解決並提交 Oracle JDK bug 的核心過程。java
總之 就是某系統上線後 CLOSE_WAIT數量隨着時間增長而大量增長,持續觸發多個告警。安全
部署了一個節點,用來複現以前出現的問題。服務器
Step1 問題聚焦微信
先查看究竟是哪些IP之間的鏈接產生了大量CLOSE_WAIT,另外系統還會涉及調第三方,總之要確認鏈接創建的雙方。網絡
執行命令: oracle
netstat -np | grep tcp|grep "CLOSE_WAIT"
結果: app
(ps:xxx、yyy、zzz 均無含義,基於信息安全考慮,屏蔽掉 ip)。運維
tcp 3547 0 10.107.17.xxx:34602 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59088 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:58028 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51962 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3563 0 10.107.17.xxx:46962 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:34608 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:46496 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:50774 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59904 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:40208 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:41064 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:36994 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3547 0 10.107.17.xxx:45080 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 6235 0 10.107.17.xxx:60966 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:56178 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 3547 0 10.107.17.xxx:39922 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:43270 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:40926 zzz.202.32.242:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:44472 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 2891 0 10.107.17.xxx:43036 zzz.202.32.241:443 CLOSE_WAIT 19819/java ........ ........ tcp 38 0 10.107.17.xxx:33472 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51976 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:57788 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:35638 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:43778 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:46418 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49914 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49258 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:48718 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:51480 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:59816 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:49266 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:50246 yyy.12.230.115:443 CLOSE_WAIT 19819/java tcp 38 0 10.107.17.xxx:39324 yyy.12.230.115:443 CLOSE_WAIT 19819/java
總之: dom
yyy.12.230.115
zzz.202.32.241
zzz.202.32.241
這個三個IP是導火索。tcp
Step2 問題分析
這三個IP具體是誰?具體是請求了哪一個接口?
暫時沒法直接獲知!最直接的導火索暫時斷了線索。接着從側面開始查看更多信息,
外部資源、線程 什麼的都看了,未發現明顯異常
要抓包獲取更多線索了。對於好久沒有碰過TCP層,有些吃力。
獲得線索:發現大量的RST
那麼是什麼操做會致使CLOSE_WAIT呢?什麼樣的鏈接致使大量RST呢(可參考RST一般緣由)?
Step3 代碼分析定位
運維大佬的協助查詢,得知這三個IP是圖片CDN服務。
至此,能夠定位到具體代碼邏輯,圖片CDN請求能夠排查代碼。
仔細分析這部分源碼後,推測由於服務器 發起 URL請求,請求不存在,致使拋出異常,可是JDK中卻沒有地方關閉Socket。
javax.imageio.read(URL)
/** * Returns a <code>BufferedImage</code> as the result of decoding * a supplied <code>URL</code> with an <code>ImageReader</code> * chosen automatically from among those currently registered. An * <code>InputStream</code> is obtained from the <code>URL</code>, * which is wrapped in an <code>ImageInputStream</code>. If no * registered <code>ImageReader</code> claims to be able to read * the resulting stream, <code>null</code> is returned. * * <p> The current cache settings from <code>getUseCache</code>and * <code>getCacheDirectory</code> will be used to control caching in the * <code>ImageInputStream</code> that is created. * * <p> This method does not attempt to locate * <code>ImageReader</code>s that can read directly from a * <code>URL</code>; that may be accomplished using * <code>IIORegistry</code> and <code>ImageReaderSpi</code>. * * @param input a <code>URL</code> to read from. * * @return a <code>BufferedImage</code> containing the decoded * contents of the input, or <code>null</code>. * * @exception IllegalArgumentException if <code>input</code> is * <code>null</code>. * @exception IOException if an error occurs during reading. */ public static BufferedImage read(URL input) throws IOException { if (input == null) { throw new IllegalArgumentException("input == null!"); } InputStream istream = null; try { //此處,創建TCP鏈接!而且直接獲取流,由於流數據不存在,進入cache塊,拋出! istream = input.openStream(); } catch (IOException e) { throw new IIOException("Can't get input stream from URL!", e); } ImageInputStream stream = createImageInputStream(istream); BufferedImage bi; try { bi = read(stream); if (bi == null) { stream.close(); } } finally { istream.close(); } return bi; }
能夠看到JDK並無關閉 ImageIO.read(url) 代碼中封裝的Socket鏈接!CDN會請求超時關閉致使服務器處於CLOSE_WAIT?限於網絡經驗有限,並不能100%確認個人想法。因此模擬下吧。
Step4 復現與模擬
根據系統業務源碼,快速模擬:
public static void main(String[] args) throws InterruptedException { ExecutorService ex = Executors.newFixedThreadPool(100); for (int i = 0; i < 5000; i++) { ex.execute(task()); } } /** * @throws IOException * @throws MalformedURLException */ private static Runnable task() { return new Runnable() { @Override public void run() { // domain must exists,but file doesnot. String vivofsUrl = "https://vivobbs.xx.yy.zz/wiwNWYCFW9ieGbWq/20181129/3a2adfde12cd328d81f965088890eeffff.jpg"; File file = null; BufferedImage image = null; try { file = File.createTempFile("abc", "jpg"); URL url1 = new URL(vivofsUrl); image = ImageIO.read(url1); } catch (Throwable e) { e.printStackTrace(); } finally { if (null != file) { file.delete(); } if (null != image) { image.flush(); image = null; } } } }; }
抓包
TCP查看
問題復現!
Step5 溝通後提報bug
report 給Oracle。
提單以後,Oracle跟我聯繫溝通。截取部分郵件內容,僅供參考。
已被採納
TCP狀態機的流轉不夠熟悉透徹。致使一些問題不能從TCP狀態機分析推理,知識的全面精通須要不斷提升。
更多內容敬請關注 vivo 互聯網技術 微信公衆號
注:轉載文章請先與微信號:Labs2020 聯繫。