今天慣例看統計報表, 才發現es集羣悲劇了......昨天下午到今天早上, 持續報錯, 寫了1G的錯誤日誌>_<#(暫無監控....)node
當前狀態: 單臺機器, 單節點(空集羣), 200W 數據, 500+shrads, 約3G大小bash
如下是幾個問題的處理過程curl
大量unassigned shards
其實剛搭完運行時就是status: yellow
(全部主分片可用,但存在不可用的從分片), 只有一個節點, 主分片啓動並運行正常, 能夠成功處理請求, 可是存在unassigned_shards
, 即存在沒有被分配到節點的從分片.(只有一個節點.....)elasticsearch
.當時數據量小, 就暫時沒關注. 而後, 隨着時間推移, 出現了大量unassigned shardsurl
curl -XGET http://localhost:9200/_cluster/health\?pretty { "cluster_name" : "elasticsearch", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 1, "active_primary_shards" : 538, "active_shards" : 538, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 558, "number_of_pending_tasks" : 0 }
處理方式: 找了臺內網機器, 部署另外一個節點(保證cluster.name
一致便可, 自動發現, 贊一個). 固然, 若是你資源有限只有一臺機器, 使用相同命令再啓動一個es實例也行. 再次檢查集羣健康, 發現unassigned_shards
減小, active_shards
增多.spa
操做完後, 集羣健康從yellow
恢復到 green
.net
status: red
集羣健康惡化了......日誌
此次檢查發現是status: red
(存在不可用的主要分片)code
curl -XGET http://localhost:9200/_cluster/health\?pretty { "cluster_name" : "elasticsearch", "status" : "red", // missing some primary shards "timed_out" : false, "number_of_nodes" : 4, "number_of_data_nodes" : 2, "active_primary_shards" : 538, "active_shards" : 1076, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 20, // where your lost primary shards are. "number_of_pending_tasks" : 0 }
fix unassigned shards
開始着手修復ip
查看全部分片狀態
curl -XGET http://localhost:9200/_cat/shards
找出UNASSIGNED
分片
curl -s "http://localhost:9200/_cat/shards" | grep UNASSIGNED pv-2015.05.22 3 p UNASSIGNED pv-2015.05.22 3 r UNASSIGNED pv-2015.05.22 1 p UNASSIGNED pv-2015.05.22 1 r UNASSIGNED
查詢獲得master節點的惟一標識
curl 'localhost:9200/_nodes/process?pretty' { "cluster_name" : "elasticsearch", "nodes" : { "AfUyuXmGTESHXpwi4OExxx" : { "name" : "Master", .... "attributes" : { "master" : "true" }, .....
執行reroute(分屢次, 變動shard的值爲UNASSIGNED
查詢結果中編號, 上一步查詢結果是1和3)
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "pv-2015.05.22", "shard" : 1, "node" : "AfUyuXmGTESHXpwi4OExxx", "allow_primary" : true } } ] }'
批量處理的腳本(當數量不少的話, 注意替換node的名字)
#!/bin/bash for index in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | awk '{print $1}' | sort | uniq); do for shard in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | grep $index | awk '{print $2}' | sort | uniq); do echo $index $shard curl -XPOST 'localhost:9200/_cluster/reroute' -d "{ 'commands' : [ { 'allocate' : { 'index' : $index, 'shard' : $shard, 'node' : 'Master', 'allow_primary' : true } } ] }" sleep 5 done done
「Too many open files」
發現日誌中大量出現這個錯誤
執行
curl http://localhost:9200/_nodes/process\?pretty
能夠看到
"max_file_descriptors" : 4096,
官方文檔中
Make sure to increase the number of open files descriptors on the machine (or for the user running elasticsearch). Setting it to 32k or even 64k is recommended.
而此時, 能夠在系統級作修改, 而後全局生效
最簡單的作法, 在bin/elasticsearch
文件開始的位置加入
ulimit -n 64000
而後重啓es, 再次查詢看到
"max_file_descriptors" : 64000,
問題解決