最近在進行ES的大規模數據入庫操做,遇到了一個問題:數據量較小時入庫正常;數據量較大時,在入庫的過程當中,通過一段時間會有部分數據節點脫離集羣,查log日誌以下:html
[2018-04-09T21:08:48,481][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][young][4312117][1523595] duration [11.4s], collections [1]/[1.6s], total [11.4s]/[12.9m], memory [27.7gb]->[15.3gb]/[31.8gb], all_pools {[young] [17.4gb]->[13.4mb]/[1.4gb]}{[survivor] [46mb]->[191.3mb]/[191.3mb]}{[old] [10gb]->[14.6gb]/[30.1gb]} node
[2018-04-09T21:08:48,481][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312117] overhead, spent [11.4s] collecting in the last [12s]
[2018-04-09T21:08:54,654][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312123] overhead, spent [412ms] collecting in the last [1.1s]bash
很明顯,JVM的full GC時間過長,這與heap size設置爲32GB有很大關係:elasticsearch
ES 內存使用和GC指標——默認狀況下,主節點每30秒會去檢查其餘節點的狀態,若是任何節點的垃圾回收時間超過30秒(Garbage collection duration),則會致使主節點任務該節點脫離集羣。ide
設置過大的heap會致使GC時間過長,這些長時間的停頓(stop-the-world)會讓集羣錯誤的認爲該節點已經脫離。 ui
可是若是把heap size設置的太小,GC太過頻繁,會影響ES入庫和搜索的效率 。日誌
經過閱讀官方文檔 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html 和博客 http://www.cnblogs.com/bonelee/p/8063915.htmlcode
Setting | Description |
---|---|
|
How often a node gets pinged. Defaults to |
|
How long to wait for a ping response, defaults to |
|
How many ping failures / timeouts cause a node to be considered failed. Defaults to |
因此經過增長ping_timeout的時間,和增長ping_retries的次數來防止節點錯誤的脫離集羣,能夠使節點有充足的時間進行full GC。
discovery.zen.fd.ping_timeout: 1000s discovery.zen.fd.ping_retries: 10