ELK之elasticsearch6.5集羣

時間 2019-12-12

標籤 elk elasticsearch6.5 elasticsearch 集羣欄目日誌分析简体版

原文原文鏈接

前面介紹並初試了es6.5系列的單節點的操做,如今搭建es6.5系列的集羣:html

環境:三節點:master-172.16.23.128.node1-172.16.23.129.node2-172.16.23.130,首先查看es的服務狀態:node

[root@master ~]# ansible all_nodes -m shell -a "systemctl status elasticsearch"|grep -i running
   Active: active (running) since 六 2018-12-29 12:06:55 CST; 3h 33min ago
   Active: active (running) since 六 2018-12-29 12:07:43 CST; 3h 32min ago
   Active: active (running) since 六 2018-12-29 15:38:47 CST; 1min 42s ago

查看各節點上面的es的配置文件:python

[root@master ~]# ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"'
172.16.23.128 | CHANGED | rc=0 >>
cluster.name: estest
node.name: esnode2
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.131"]

172.16.23.130 | CHANGED | rc=0 >>
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

172.16.23.129 | CHANGED | rc=0 >>
cluster.name: es
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200

如今基於discovery.zen作集羣配置參考:https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-discovery-zen.html,具體配置以下:es6

[root@master ~]# ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"'
172.16.23.128 | CHANGED | rc=0 >>
cluster.name: estest
node.name: master
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]

172.16.23.130 | CHANGED | rc=0 >>
cluster.name: estest
node.name: node2
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]

172.16.23.129 | CHANGED | rc=0 >>
cluster.name: estest
node.name: node1
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["172.16.23.128", "172.16.23.129", "172.16.23.130"]

重啓elasticsearch服務:shell

[root@master ~]# ansible all_nodes -m shell -a "systemctl restart elasticsearch"
172.16.23.130 | CHANGED | rc=0 >>


172.16.23.128 | CHANGED | rc=0 >>


172.16.23.129 | CHANGED | rc=0 >>

而後查看集羣狀態:json

[root@master ~]# curl -X GET "localhost:9200/_cluster/health" -s|python -m json.tool
{
    "active_primary_shards": 0,
    "active_shards": 0,
    "active_shards_percent_as_number": 100.0,
    "cluster_name": "estest",
    "delayed_unassigned_shards": 0,
    "initializing_shards": 0,
    "number_of_data_nodes": 3,
    "number_of_in_flight_fetch": 0,
    "number_of_nodes": 3,
    "number_of_pending_tasks": 0,
    "relocating_shards": 0,
    "status": "green",
    "task_max_waiting_in_queue_millis": 0,
    "timed_out": false,
    "unassigned_shards": 0
}

查看節點個數:bash

[root@master ~]# curl -X GET "localhost:9200/_cat/nodes?v"
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.16.23.128           28          71   3    0.04    0.11     0.08 mdi       *      master
172.16.23.130           29          67   4    0.04    0.11     0.10 mdi       -      node2
172.16.23.129           28          58   4    0.12    0.20     0.13 mdi       -      node1

單單隻看master節點:app

[root@master ~]# curl -X GET "localhost:9200/_cat/master?v"
id                     host          ip            node
hVY-U_ocQueMtcryoGGbTg 172.16.23.128 172.16.23.128 master

查看集羣health:curl

[root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1546070536 08:02:16  estest  green           3         3      0   0    0    0        0             0                  -                100.0%

查看nodeattrs屬性:elasticsearch

[root@master ~]# curl -X GET "localhost:9200/_cat/nodeattrs?v"
node   host          ip            attr              value
master 172.16.23.128 172.16.23.128 ml.machine_memory 3956293632
master 172.16.23.128 172.16.23.128 xpack.installed   true
master 172.16.23.128 172.16.23.128 ml.max_open_jobs  20
master 172.16.23.128 172.16.23.128 ml.enabled        true
node2  172.16.23.130 172.16.23.130 ml.machine_memory 3956293632
node2  172.16.23.130 172.16.23.130 ml.max_open_jobs  20
node2  172.16.23.130 172.16.23.130 xpack.installed   true
node2  172.16.23.130 172.16.23.130 ml.enabled        true
node1  172.16.23.129 172.16.23.129 ml.machine_memory 3956293632
node1  172.16.23.129 172.16.23.129 ml.max_open_jobs  20
node1  172.16.23.129 172.16.23.129 xpack.installed   true
node1  172.16.23.129 172.16.23.129 ml.enabled        true

如今手動建立一個index爲test:

# curl -X PUT "localhost:9200/test"

而後查看各節點index狀況:

[root@master ~]# ansible all_nodes -m shell -a 'curl -X GET "localhost:9200/_cat/indices?v" -s'
 [WARNING]: Consider using the get_url or uri module rather than running curl.  If you need to use command because get_url or uri is insufficient you can add
warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.

172.16.23.128 | CHANGED | rc=0 >>
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb

172.16.23.130 | CHANGED | rc=0 >>
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb

172.16.23.129 | CHANGED | rc=0 >>
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      1.7kb          1.1kb

查看index的分片狀況:

[root@master ~]# curl -X GET "localhost:9200/_cat/shards?v"
index shard prirep state      docs store ip            node
test  3     p      STARTED       0  230b 172.16.23.128 master
test  3     r      STARTED       0  230b 172.16.23.130 node2
test  2     r      STARTED       0  230b 172.16.23.129 node1
test  2     p      STARTED       0  230b 172.16.23.130 node2
test  1     p      STARTED       0  230b 172.16.23.129 node1
test  1     r      UNASSIGNED                          
test  4     p      STARTED       0  230b 172.16.23.129 node1
test  4     r      UNASSIGNED                          
test  0     p      STARTED       0  230b 172.16.23.128 master
test  0     r      STARTED       0  230b 172.16.23.130 node2

由上面能夠看出有兩個分片是UNASSIGNED狀態,查看集羣health:

[root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1546071645 08:20:45  estest  yellow          3         3      8   5    0    0        2             0                  -                 80.0%

使用下面的命令定位有問題的分片以及緣由:

[root@master ~]# curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason -s| grep UNASSIGNED
test 1 r UNASSIGNED INDEX_CREATED
test 4 r UNASSIGNED INDEX_CREATED

獲取分片更多信息:

[root@master ~]# curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "index" : "test",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2018-12-29T08:14:47.378Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "hVY-U_ocQueMtcryoGGbTg",
      "node_name" : "master",
      "transport_address" : "172.16.23.128:9300",
      "node_attributes" : {
        "ml.machine_memory" : "3956293632",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]"
        }
      ]
    },
    {
      "node_id" : "q95yZ4W4Tj6PaXyzLZZYDQ",
      "node_name" : "node1",
      "transport_address" : "172.16.23.129:9300",
      "node_attributes" : {
        "ml.machine_memory" : "3956293632",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test][1], node[q95yZ4W4Tj6PaXyzLZZYDQ], [P], s[STARTED], a[id=j7V8PBUvQnOZzISPAxK9Uw]]"
        }
      ]
    },
    {
      "node_id" : "_ADSWG04TEqNfX_88ejtzQ",
      "node_name" : "node2",
      "transport_address" : "172.16.23.130:9300",
      "node_attributes" : {
        "ml.machine_memory" : "3956293632",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 3,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]"
        }
      ]
    }
  ]
}

由上面結果可知node1,node2的es版本不一樣於master的es版本:

[root@master ~]# ansible all_nodes -m shell -a 'rpm -qa|grep elasticsearch'
 [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm.  If you need to use command because yum, dnf or zypper is insufficient you can
add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.

172.16.23.128 | CHANGED | rc=0 >>
elasticsearch-6.5.2-1.noarch

172.16.23.130 | CHANGED | rc=0 >>
elasticsearch-6.5.2-1.noarch

172.16.23.129 | CHANGED | rc=0 >>
elasticsearch-6.5.4-1.noarch

將其中上面版本不一致的替換掉後,開啓es服務,而後觀察集羣以及shards狀況:

[root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1546073143 08:45:43  estest  red             1         1      2   2    0    0        8             0                  -                 20.0%
[root@master ~]# curl -X GET "localhost:9200/_cat/health?v"
epoch      timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1546073274 08:47:54  estest  green           3         3     10   5    0    0        0             0                  -                100.0%

[root@master ~]# curl -X GET "localhost:9200/_cat/shards?v"
index shard prirep state   docs store ip            node
test  3     p      STARTED    0  261b 172.16.23.128 master
test  3     r      STARTED    0  261b 172.16.23.130 node2
test  4     r      STARTED    0  261b 172.16.23.128 master
test  4     p      STARTED    0  261b 172.16.23.129 node1
test  2     r      STARTED    0  261b 172.16.23.129 node1
test  2     p      STARTED    0  261b 172.16.23.130 node2
test  1     p      STARTED    0  261b 172.16.23.129 node1
test  1     r      STARTED    0  261b 172.16.23.130 node2
test  0     p      STARTED    0  261b 172.16.23.128 master
test  0     r      STARTED    0  261b 172.16.23.130 node2

索引test由10個分片組成,五個主分片,5個replica shard,replica shard是primary shard的副本，負責容錯，以及承擔讀請求負載,primary shard的數量在建立索引的時候就固定了，replica shard的數量能夠隨時修改,primary shard的默認數量是5，replica默認是1

[root@master ~]# curl -XGET localhost:9200/test?pretty
{
  "test" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "creation_date" : "1546071287243",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "l0Js1PJLTPSFEdXhanVSHA",
        "version" : {
          "created" : "6050299"
        },
        "provided_name" : "test"
      }
    }
  }
}

[root@master ~]# curl -XGET localhost:9200/_cat/indices?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test  l0Js1PJLTPSFEdXhanVSHA   5   1          0            0      2.5kb          1.2kb

primary shard不能和本身的replica shard放在同一個節點上（不然節點宕機，primary shard和副本都丟失，起不到容錯的做用），可是能夠和其餘primary shard的replica shard放在同一個節點上

節點以及shards數分配參考:https://blog.csdn.net/qq_38486203/article/details/80077844

而後這裏梳理一下es中一些基礎概念:

1.cluster:

集羣，一個ES集羣由一個或多個節點（Node）組成，每一個集羣都有一個cluster name做爲標識

2.node:

節點，一個ES實例就是一個node，一個機器能夠有多個實例，一個集羣由多個節點構成,大多數狀況下每一個node運行在一個獨立的環境或虛擬機上。

3.index:

索引，即一系列documents的集合

3.shard:

分片，ES是分佈式搜索引擎，每一個索引有一個或多個分片，索引的數據被分配到各個分片上，至關於一桶水用了N個杯子裝

分片有助於橫向擴展，N個分片會被儘量平均地（rebalance）分配在不一樣的節點上（例如你有2個節點，4個主分片(不考慮備份)，那麼每一個節點會分到2個分片，後來你增長了2個節點，那麼你這4個節點上都會有1個分片，這個過程叫relocation，ES感知後自動完成)

分片是獨立的，對於一個Search Request的行爲，每一個分片都會執行這個Request.另外每一個分片都是一個Lucene Index，因此一個分片只能存放 Integer.MAX_VALUE - 128 = 2,147,483,519 個docs

4.replica:

複製，能夠理解爲備份分片，相應地有primary shard（主分片）

主分片和備分片不會出如今同一個節點上（防止單點故障），默認狀況下一個索引建立5個分片一個備份（即5primary+5replica=10個分片）

若是你只有一個節點，那麼5個replica都沒法分配（unassigned），此時cluster status會變成Yellow。

ES集羣的三種狀態:

Green: 全部主分片和備份分片都準備就緒,分配成功, 即便有一臺機器掛了(假設一臺機器實例),數據都不會丟失,可是會變成yellow狀態.

Yellow: 全部主分片準備就緒,但至少一個主分片(假設是A)對應的備份分片沒有就緒,此時集羣處於告警狀態,意味着高可用和容災能力降低.若是恰好A所在的機器掛了,而且你只設置了一個備份(已處於未繼續狀態), 那麼A的數據就會丟失(查詢不完整),此時集羣處於Red狀態.

Red:至少有一個主分片沒有就緒(直接緣由是找不到對應的備份分片成爲新的主分片),此時查詢的結果會出現數據丟失(不完整).

容災：primary分片丟失，replica分片就會被頂上去成爲新的主分片，同時根據這個新的主分片建立新的replica，集羣數據安然無恙

提升查詢性能：replica和primary分片的數據是相同的，因此對於一個query既能夠查主分片也能夠查備分片，在合適的範圍內多個replica性能會更優（但要考慮資源佔用也會提高[cpu/disk/heap]），另外index request只能發生在主分片上，replica不能執行index request。

對於一個索引，除非重建索引不然不能調整分片的數目（主分片數, number_of_shards），但能夠隨時調整replica數(number_of_replicas)。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。