ELASTICSEARCH健康red的解決

今天慣例看統計報表, 才發現es集羣悲劇了......昨天下午到今天早上, 持續報錯, 寫了1G的錯誤日誌>_<#(暫無監控....)node

當前狀態: 單臺機器, 單節點(空集羣), 200W 數據, 500+shrads, 約3G大小bash

如下是幾個問題的處理過程curl

大量unassigned shards

其實剛搭完運行時就是status: yellow(全部主分片可用，但存在不可用的從分片), 只有一個節點, 主分片啓動並運行正常, 能夠成功處理請求, 可是存在unassigned_shards, 即存在沒有被分配到節點的從分片.(只有一個節點.....)elasticsearch

.當時數據量小, 就暫時沒關注. 而後, 隨着時間推移, 出現了大量unassigned shardsurl

curl -XGET http://localhost:9200/_cluster/health\?pretty
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 538,
  "active_shards" : 538,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 558,
"number_of_pending_tasks" : 0
}

處理方式: 找了臺內網機器, 部署另外一個節點(保證cluster.name一致便可, 自動發現, 贊一個). 固然, 若是你資源有限只有一臺機器, 使用相同命令再啓動一個es實例也行. 再次檢查集羣健康, 發現unassigned_shards減小, active_shards增多.spa

操做完後, 集羣健康從yellow恢復到 green.net

status: red

集羣健康惡化了......日誌

此次檢查發現是status: red(存在不可用的主要分片)code

curl -XGET http://localhost:9200/_cluster/health\?pretty
{
  "cluster_name" : "elasticsearch",
  "status" : "red",    // missing some primary shards
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 538,
  "active_shards" : 1076,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 20,  // where your lost primary shards are.
  "number_of_pending_tasks" : 0
}

fix unassigned shards

開始着手修復ip

查看全部分片狀態

curl -XGET http://localhost:9200/_cat/shards

找出UNASSIGNED分片

curl -s "http://localhost:9200/_cat/shards" | grep UNASSIGNED
pv-2015.05.22                 3 p UNASSIGNED
pv-2015.05.22                 3 r UNASSIGNED
pv-2015.05.22                 1 p UNASSIGNED
pv-2015.05.22                 1 r UNASSIGNED

查詢獲得master節點的惟一標識

curl 'localhost:9200/_nodes/process?pretty'

{
  "cluster_name" : "elasticsearch",
  "nodes" : {
    "AfUyuXmGTESHXpwi4OExxx" : {
      "name" : "Master",
     ....
      "attributes" : {
        "master" : "true"
      },
.....

執行reroute(分屢次, 變動shard的值爲UNASSIGNED查詢結果中編號, 上一步查詢結果是1和3)

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
        "commands" : [ {
              "allocate" : {
                  "index" : "pv-2015.05.22",
                  "shard" : 1,
                  "node" : "AfUyuXmGTESHXpwi4OExxx",
                  "allow_primary" : true
              }
            }
        ]
    }'

批量處理的腳本(當數量不少的話, 注意替換node的名字)

#!/bin/bash for index in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | awk '{print $1}' | sort | uniq); do for shard in $(curl -s 'http://localhost:9200/_cat/shards' | grep UNASSIGNED | grep $index | awk '{print $2}' | sort | uniq); do echo $index $shard curl -XPOST 'localhost:9200/_cluster/reroute' -d "{  'commands' : [ {  'allocate' : {  'index' : $index,  'shard' : $shard,  'node' : 'Master',  'allow_primary' : true  }  }  ]  }" sleep 5 done done

「Too many open files」

發現日誌中大量出現這個錯誤

執行

curl http://localhost:9200/_nodes/process\?pretty

能夠看到

"max_file_descriptors" : 4096,

官方文檔中

Make sure to increase the number of open files descriptors on the machine (or for the user running elasticsearch). Setting it to 32k or even 64k is recommended.

而此時, 能夠在系統級作修改, 而後全局生效

最簡單的作法, 在bin/elasticsearch文件開始的位置加入

ulimit -n 64000

而後重啓es, 再次查詢看到

"max_file_descriptors" : 64000,

問題解決