ElasticSearch- 單節點 unassigned_shards 故障排查

時間 2021-01-13

標籤 html node shell json app curl elasticsearch 工具 fetch url 欄目日誌分析简体版

原文原文鏈接

故障現象

在部署ELK的單機環境，當鏈接Kibana時候提示下面錯誤，即便重啓整個服務也是提示Kibana server is not ready.html

{"message":"all shards failed: [search_phase_execution_exception] all shards failed","statusCode":503,"error":"Service Unavailable"}

排查過程

前段時間ELK服務仍是正常的，進入容器去ping ip 也都沒問題，服務也都是Up 狀態； ElasticSearch 服務也能夠經過http://localhost:9200/ 訪問到，可是就是kibana 不能鏈接ElasticSearchnode

再查看 kibana 日誌發現以下信息, 其中包含了no_shard_available_action_exception, 看起來是分片 的問題。shell

{
    "type": "error",
    "@timestamp": "2020-09-15T00:41:09Z",
    "tags": [
        "warning",
        "stats-collection"
    ],
    "pid": 1,
    "level": "error",
    "error": {
        "message": "[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]]",
        "name": "Error",
        "stack": "[no_shard_available_action_exception] No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]] :: {\"path\":\"/.kibana/doc/config%3A6.8.11\",\"query\":{},\"statusCode\":503,\"response\":\"{\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"no_shard_available_action_exception\\\",\\\"reason\\\":\\\"No shard available for [get [.kibana][doc][config:6.8.11]: routing [null]]\\\"}],routing [null]]"
    }

經過 ES可視化工具-cerebro 查看json

實際當時狀況是"紅色"的，而不是目前看到的 "黃色"， heap/disk/cup/load 基本都是紅色的, 可能由於當時手動刪除了幾個index緣由app

黃色雖然kibana能夠訪問ES了，可是黃色表明ES仍然是不健康的curl

查看單節點Elasticsearch健康狀態

curl -XGET http://localhost:9200/_cluster/health\?prettyelasticsearch

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 677,
  "active_shards" : 677,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 948,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 5,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 599,
  "active_shards_percent_as_number" : 41.559238796807854
}

從上面的 unassigned_shards 能夠存在大量分片沒有被分配，當時看到的實際有1000多個。工具

查詢 UNASSIGNED 類型的索引名字

curl -XGET http://localhost:9200/_cat/shardsfetch

故障緣由大概肯定了，應該就是unassigned_shards致使的下面就看如何解決url

解決方案

若是是集羣環境，能夠考慮使用 POST /_cluster/reroute 強制把問題分片分配到其中一個節點上了
可是對於目前的單機環境，從上面截圖能夠看出存在5個 unassigned 的分片，新建索引時候，分片數爲5，副本數爲1，新建以後集羣狀態成爲yellow，其根本緣由是由於集羣存在沒有啓用的副本分片。

解決辦法就是，在單節點的elasticsearch集羣，刪除存在副本分片的索引，新建索引的副本都設爲0。而後再查看集羣狀態

經過若是下命令，設置number_of_replicas=0,將副本調整爲0. 以下圖所示，es變成了「綠色」
```
curl -XPUT 'http://localhost:9200/_settings' -H 'content-Type:application/json' -d'
{
"number_of_replicas": 0
}'
```