做者 | yhf20071java
【Arthas 官方社區正在舉行徵文活動,參加即有獎品拿~點擊投稿】node
公司新搭 HDFS 集羣,namenode作ha,可是在啓動 StandbyNamenode 節點的時候出現奇怪的現象:空集羣加載 Editlog 很慢,每次重啓幾乎耗時都在二三十分鐘apache
爲了方便你們理解,大體說下 StandbyNamenode(如下簡稱 SNN)啓動過程:網絡
問題就出在第 2 步,在從 JournalNode 拉取 EditLog 過程當中出現固定 15s 延遲。通常來講,空集羣幾乎沒有操做, editlog 不會太大,不該該出現每次從 JournalNode 拉取 EditLog 都耗費 15s 的時間,日誌以下(爲了方便觀察截取部分日誌):socket
2020-11-04 18:27:27,577 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://cbdp-online1.sdns.fin ancial.cloud:8480/getJournal?jid=hdfs-ha&segmentTxId=213656&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true' to transaction ID 184269 2020-11-04 18:27:42,582 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(289)) - replaying edit log: 1/44 transactions completed. (2%) 2020-11-04 18:27:42,583 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://cbdp-online1.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha &segmentTxId=213656&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true, http://cbdp-online2.sdns.financial.cloud:8 480/getJournal?jid=hdfs-ha&segmentTxId=213656&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true, http://cbdp-onli ne3.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&segmentTxId=213656&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgres sOk=true of size 5981 edits # 44 loaded in 15 seconds ...... 2020-11-04 18:27:42,583 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream 'http://cbdp-online1.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&;segmentTxId=213700&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true' to transaction ID 184269 2020-11-04 18:27:57,588 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(289)) - replaying edit log: 1/53 transactions completed. (2%) 2020-11-04 18:27:57,589 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(162)) - Edits file http://cbdp-online1.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&;segmentTxId=213700&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true, http://cbdp-online2.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&;segmentTxId=213700&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true, http://cbdp-online3.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&;segmentTxId=213700&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true of size 7088 edits # 53 loaded in 15 seconds
trace org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader loadFSEdits
profiler start/stop
函數
trace org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1 run
trace --skipJDKMethods false sun.net.www.http.HttpClient parseHTTPHeader
trace --skipJDKMethods false java.net.SocketInputStream socktRead '#cost > 10000'
stack *SocketInputStream socketRead "#cost > 10000"
發現因爲 StandbyNameNode 的網絡讀取數據形成阻塞,到此已經碰到 native 函數,在 java 層面已經沒有有效方法進行分析。oop
這時我看到 StandbyNameNode 的日誌:spa
2020-11-04 18:27:42,583 INFO namenode.RedundantEditLogInputStream (RedundantEditLogInputStream.java:nextOp(177)) - Fast-forwarding stream '
http://cbdp-online1.sdns.financial.cloud:8480/getJournal?jid=hdfs-ha&;segmentTxId=213700&storageInfo=-64%3A272699407%3A1603893889358%3ACID-aa8ec1b5-a501-4195-9299-e14abefbdc11&inProgressOk=true
' to transaction ID 184269
.net
同時想起了 @赫炎 提出的思路,有多是在 JournalNode 端讀取 EditLog 文件的時候有阻塞。調試
trace --skipJDKMethods false org.apache.hadoop.hdfs.qjournal.server.GetJournalEditServlet doGet '#cost > 10000'
發如今調用 java.net.InetSocketAddress.getHostName
處耗時 15s,至此找到了罪魁禍首。
爲了驗證猜測,在每一個 JournalNode 節點 hosts 文件配置 0.0.0.0 0.0.0.0,重啓 SNN,速度提高了 20 倍
不得不說,Arthas 做爲動態追蹤調試 java 進程的神器,真的很方便開發人員定位問題。贊一個!