elasticsearch入庫錯誤：gc overhead致使數據節點脫離集羣

時間 2019-11-13

標籤 elasticsearch 入庫錯誤 overhead 致使數據節點脫離集羣欄目日誌分析简体版

原文原文鏈接

最近在進行ES的大規模數據入庫操做，遇到了一個問題：數據量較小時入庫正常；數據量較大時，在入庫的過程當中，通過一段時間會有部分數據節點脫離集羣，查log日誌以下：html

[2018-04-09T21:08:48,481][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][young][4312117][1523595] duration [11.4s], collections [1]/[1.6s], total [11.4s]/[12.9m], memory [27.7gb]->[15.3gb]/[31.8gb], all_pools {[young] [17.4gb]->[13.4mb]/[1.4gb]}{[survivor] [46mb]->[191.3mb]/[191.3mb]}{[old] [10gb]->[14.6gb]/[30.1gb]} node

[2018-04-09T21:08:48,481][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312117] overhead, spent [11.4s] collecting in the last [12s]
[2018-04-09T21:08:54,654][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312123] overhead, spent [412ms] collecting in the last [1.1s]bash

很明顯，JVM的full GC時間過長，這與heap size設置爲32GB有很大關係：elasticsearch

ES 內存使用和GC指標——默認狀況下，主節點每30秒會去檢查其餘節點的狀態，若是任何節點的垃圾回收時間超過30秒（Garbage collection duration），則會致使主節點任務該節點脫離集羣。ide

設置過大的heap會致使GC時間過長，這些長時間的停頓（stop-the-world）會讓集羣錯誤的認爲該節點已經脫離。 ui

可是若是把heap size設置的太小，GC太過頻繁，會影響ES入庫和搜索的效率。日誌

經過閱讀官方文檔 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html 和博客 http://www.cnblogs.com/bonelee/p/8063915.htmlcode

Setting	Description
`ping_interval`htm	How often a node gets pinged. Defaults to `1s`.blog
`ping_timeout`	How long to wait for a ping response, defaults to `30s`.
`ping_retries`	How many ping failures / timeouts cause a node to be considered failed. Defaults to `3`.

因此經過增長ping_timeout的時間，和增長ping_retries的次數來防止節點錯誤的脫離集羣，能夠使節點有充足的時間進行full GC。

discovery.zen.fd.ping_timeout: 1000s
discovery.zen.fd.ping_retries: 10

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。