Elasticsearch 學習2. 入門

時間 2020-06-11

標籤 elasticsearch 學習入門欄目日誌分析简体版

原文原文鏈接

1.基本概念

Index索引java
- Document 文檔
- Type 類型
Node節點node
- Shard分片

1.1 文檔（Document）

1.1.1 文檔

Elasticsearch是面向文檔的，文檔是全部可搜索數據的最小單位。
文檔會被序列化成JSON格式，保存在Elasticsearch中。正則表達式
- JSON對象由字段構成。
- 每一個字段都有對應的字段類型（字符串、數值、布爾、日期、二進制、範圍）
每一個文檔都有一個Unique ID數據庫
- 能夠本身指定ID
- 也能夠由Elasticsearch自動生成。

1.1.2 JSON文檔

一篇文檔相似於數據庫表中的一條記錄。
JSON文檔格式靈活，不須要預先定義格式。express
- 字段的類型能夠指定或者經過Elasticsearch自動推算。
- 支持數組、嵌套。

1.1.3 文檔元數據

{
  "_index" : "movies",
  "_type" : "_doc",
  "_id" : "8609",
  "_score" : 1.0,
  "_source" : {
    "year" : 1923,
    "title" : "Our Hospitality",
    "@version" : "1",
    "id" : "8609",
    "genre" : [
      "Comedy"
    ]
  }
}

_index：文檔所屬的索引名
_type：文檔所屬的類型名
_id：文檔惟一id
_score：文檔相關性打分
_source：文檔的原始JSON數據
_version：文檔的版本信息

1.2 索引(Index)

1.2.1 索引

索引是文檔的容器，是一類文檔的集合。json
- Index：體現了邏輯空間的概念。每一個索引都有本身的Mapping定義，用於定義包含的文檔的字段名和字段類型。
- Shard：體現了物理空間的概念，索引中的數據分佈在 Shard 上。
索引的 Mapping 和 Setting數組
- Mapping：定義文檔字段的類型。
- Setting：定義不一樣的數據分佈。

1.2.2 索引的不一樣語義

名詞：一個Elasticsearch集羣中，能夠建立不少個不一樣的索引。
動詞：保存一個文檔到Elasticsearh的過程也叫索引（indexing）app
- Elasticsearch建立倒排索引。

1.3 類型(Type)

7.0以前，一個Index能夠設置多個Types
7.0開始，一個Index只能建立一個Type：_docelasticsearch
- 6.0開始，Type被Deprated。

1.4 REST API

GET /_cat/indices?v：查看索引
GET /_cat/indices?v&health=green：查看狀態爲綠的索引
GET /_cat/indices?v&s=docs.count:desc：按照文檔個數對索引進行排序

1.5 集羣(Cluster)

1.5.1 分佈式特性

高可用性分佈式
- 服務可用性：容許有節點中止服務
- 數據可用性：部分節點丟失，不會丟失數據。
可擴展性
- 請求量提高/數據的不斷增加（將數據分佈到全部節點上）

1.5.2 集羣

不一樣的集羣經過不一樣的名字來區分，默認爲elasticsearch
經過配置文件修改，或者在命令行指定-E cluster.name=demo進行設置
一個集羣能夠有一個或者多個節點。

1.6 節點(Node)

1.6.1 節點

節點是一個Elasticsearch實例
- 本質上是一個java進程。
- 一臺機器上能夠運行多個實例，可是生產環境建議一臺機器只運行一個Elasticsearch實例。
每一個節點都有名字，經過配置文件能夠設置。或者啓動實例的時候經過參數 -E node.name=node1 指定。
每個節點啓動以後，會分配一個UID，保存在 data 目錄下

1.6.2 Master-eligible Node 和 Master Node

每一個節點啓動後，默認就是一個Master-eligible節點。
- 能夠設置 node.master:false 禁止
Master-eligible能夠參加選注流程，成爲Master節點
當第一個節點啓動時候，它會將本身選舉成Master節點。
每一個節點上都保存了集羣的狀態，只有Master節點才能修改集羣的狀態信息
- 任意節點都能修改信息會致使數據的不一致性。
- 集羣狀態，爲了一個集羣中必要的信息
  - 全部的節點信息
  - 全部的索及和其相關 Mapping 、Setting 信息
  - 分片的路由信息

1.6.3 Data Node & Coordinating Node

Data Node
- 能夠保存數據的節點，叫作Data Node。
- 負責保存分片數據。
- 在數據擴展上起到重要做用。
Coordinating Node
- 負責接受Client的請求，將請求分發到合適的節點，最終把結果聚集到一塊兒。
- 每一個節點 默認 都起到了Coordinating Node的做用。

1.6.4 配置節點類型

開發環境，一個節點能夠承擔多種角色。

生產環境，應該設置單一角色的節點。

節點類型	配置參數	默認值
master eligible	node.master	true
data	node.data	true
ingest	node.ingest	true
coordinating only	/	每一個節點默認都是Coordinating Node
machine learning	node.ml	true, 須要enable x-pack

1.7 分片(Shard)

1.7.1 Primary Shard & Replica Shard

主分片（Primary Shard）：
- 用以解決數據水平擴展的問題。經過主分片能夠將數據分佈到集羣內的全部節點上。
- 一個主分片是一個運行的Lucene實例，是一個索引。
- 主分片數在索引建立時指定，後續不容許修改，除非Reindex。
副本分片 (Replica Shard)：
- 用以解決數據高可用的問題。
- 副本分片是主分片的副本。
- 副本分片數能夠動態調整。

1.7.2 分片的設置

生產環境中分片的設置，須要提早作好容量規劃。

分片數設置太小
- 致使沒法增長節點實現水平擴展。
- 單個分片的數據量太大，致使數據從新分配耗時
分片數設置過大
- 7.0開始，默認主分片設置成1，解決了over-sharding的問題。
- 影響搜索結果的相關性打分，影響統計結果的準確性。
- 單個節點上過多的分片，致使資源浪費，同時也會影響性能。

1.7.3 查看集羣的健康情況

GET /_cluster/health

{
  "cluster_name" : "learn_es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 7,
  "active_shards" : 14,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

green : 主分片與副本分片都正常分配。
yellow : 主分片正常分配，有副本分片未能正常分配。
red : 有主分片未能分配。

2. 文檔的CRUD

2.1 基本API

2.1.1 說明

四種基本操做：

Index：添加文檔
- POST <index>/_doc：添加的文檔id爲系統自動生成。
- PUT <index>/_doc/<_id>：若是該id的文檔不存在則添加，存在則更新同時增長版本號（version字段）。
- POST <index>/_create/<_id>：若是該id的文檔已存在，則報錯。
- PUT <index>/_create/<_id>：若是該id的文檔已存在，則報錯。
Get：讀取文檔
- GET <index>/_doc/<_id>：獲取該id文檔的元信息
- GET <index>/_source/<_id>：獲取該id文檔元信息中的 _source 字段
- HEAD <index>/_doc/<_id>：判斷該id文檔是否存在，存在返回200，不存在返回404
- HEAD <index>/_source/<_id>：判斷該id文檔中的_source字段是否存在，存在返回200，不存在返回404
Update：更新文檔
- POST <index>/_update/<_id>：更新部分文檔，body體中使用doc字段。
Delete：刪除文檔
- DELETE /<index>/_doc/<_id>：刪除該id的文檔，若是文檔不存在什麼都不作

2.1.2 Index API

支持 自動生成 文檔id和 指定 文檔id。

自動生成文檔id。

語法: POST <index>/_doc

demo:

POST users/_doc
{
  "user" : "Mike",
  "phone" : "15512345678"
}

-----------

{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "RfXT_28B5V-KMglJX8bm",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

指定文檔id

語法：PUT <index>/_doc/<_id> 或者 POST | PUT <index>/_create/<_id>

demo 1：PUT <index>/_doc/<_id>

PUT users/_doc/1
{
  "user" : "John",
  "phone" : "15812345678"
}

--------
# 不存在該id的文檔時，直接新增
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 4,
  "_primary_term" : 1
}

# 存在該id的文檔時，替換文檔（刪除現有的，建立新的，version +1）
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 23,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 26,
  "_primary_term" : 1
}

demo 2: POST | PUT <index>/_create/<_id>

POST users/_create/2
{
  "user" : "Dave",
  "phone" : "15912345678"
}

---------
# 不存在該id的文檔時，直接新增
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 27,
  "_primary_term" : 1
}

# 存在該id的文檔時，version衝突，報錯。
{
  "error": {
    "root_cause": [
      {
        "type": "version_conflict_engine_exception",
        "reason": "[2]: version conflict, document already exists (current version [1])",
        "index_uuid": "mjgjxIROT72xLMHnYNiUxw",
        "shard": "0",
        "index": "users"
      }
    ],
    "type": "version_conflict_engine_exception",
    "reason": "[2]: version conflict, document already exists (current version [1])",
    "index_uuid": "mjgjxIROT72xLMHnYNiUxw",
    "shard": "0",
    "index": "users"
  },
  "status": 409
}

2.1.3 Get API

根據id查找文檔

語法：GET <index>/_doc/<_id>

demo：

GET users/_doc/2

--------

# 該id的文檔存在，返回文檔元信息
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "_seq_no" : 27,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "user" : "Dave",
    "phone" : "15912345678"
  }
}

# 該id的文檔不存在，返回找不到
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "found" : false
}

2.1.4 Update API

更新指定id的文檔：

語法：POST <index>/_update/<_id>

demo：更新部分文檔

POST users/_update/1
{
  "doc": {
    "age":28
  }
}

--------
# 該id的文檔存在，且字段值有變更 則更新文檔；若是文檔存在，且字段值無變更，result爲noop
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 27,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 34,
  "_primary_term" : 1
}

demo2：按照腳本更新文檔

# index the doc
PUT users/_doc/2
{
  "name" : "John",
  "counter" : 1
}

{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 6,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 53,
  "_primary_term" : 2
}

--------

# update the doc
POST users/_update/2
{
  "script": {
    "source": "ctx._source.counter += params.count",
    "params": {
      "count":2
    }
  }
}

{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 7,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 54,
  "_primary_term" : 2
}

2.1.5 Delete API

根據id刪除文檔

語法：Delete <index>/_doc/<_id>

demo：

DELETE users/_doc/2

--------

# 該id的文檔存在，直接刪除
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 2,
  "result" : "deleted",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 31,
  "_primary_term" : 1
}

# 該id的文檔不存在，什麼都不作
{
  "_index" : "users",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 3,
  "result" : "not_found",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 32,
  "_primary_term" : 1
}

2.2 其餘API

2.2.1 Bulk API

支持在一次API調用中，對不一樣的索引進行操做。
操做中單條操做失敗，並不會影響其餘操做。
返回結果包含了每一條操做的結果。
支持四種類型操做。
- Index
- Create
- Update
- Delete

語法；POST _bulk 或者 POST <index>/_bulk`

newline delimited JSON (NDJSON)結構

action_and_meta_data\n
optional_source\n
action_and_meta\_data\n
optional_source\n 
.... 
action_and_meta_data\n
optional_source\n

demo：

POST _bulk
# index、create：下一行須要跟着source
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "create" : { "_index" : "test", "_id" : "2" } }
{ "field2" : "value2" }
# update下一行須要跟着doc或者script
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field3" : "value3"} }
# delete與標準delete API語法同樣
{ "delete" : { "_index" : "test", "_id" : "2" } }

2.2.2 mget API

批量讀取文檔
語法：GET _mget 或者 GET <index>/_mget

demo：

GET /_mget
{
    "docs" : [
        {
            "_index" : "users",
            "_id" : "1"
        },
        {
            "_index" : "twitter",
            "_id" : "2"
        }
    ]
}

--------

{
  "docs" : [
    {
      "_index" : "users",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 31,
      "_seq_no" : 38,
      "_primary_term" : 2,
      "found" : true,
      "_source" : {
        "user" : "abc",
        "class" : 8,
        "age" : 28,
        "gender" : "male",
        "field1" : "value1"
      }
    },
    {
      "_index" : "twitter",
      "_type" : null,
      "_id" : "2",
      "error" : {
        "root_cause" : [
          {
            "type" : "index_not_found_exception",
            "reason" : "no such index [twitter]",
            "resource.type" : "index_expression",
            "resource.id" : "twitter",
            "index_uuid" : "_na_",
            "index" : "twitter"
          }
        ],
        "type" : "index_not_found_exception",
        "reason" : "no such index [twitter]",
        "resource.type" : "index_expression",
        "resource.id" : "twitter",
        "index_uuid" : "_na_",
        "index" : "twitter"
      }
    }
  ]
}

demo2:

GET users/_mget
{
  "docs": [
    {
      "_id" : "2"
    },
    {
      "_id" : "3"
    }
  ]
}

GET users/_mget
{
  "ids" : ["2", "3"]
}

--------

{
  "docs" : [
    {
      "_index" : "users",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 7,
      "_seq_no" : 54,
      "_primary_term" : 2,
      "found" : true,
      "_source" : {
        "name" : "John",
        "counter" : 3
      }
    },
    {
      "_index" : "users",
      "_type" : "_doc",
      "_id" : "3",
      "found" : false
    }
  ]
}

3. 倒排索引

3.1 組成

倒排索引包含兩個部分：

單詞詞典（Term Dictionary）：
- 記錄全部文檔的單詞，記錄單詞到倒排列表的關聯關係
- 通常比較大，經過B+樹或者哈希拉鍊法實現，以知足高性能的插入與查詢。
倒排列表（Posting List）：
- 記錄單詞對應的文檔結合，由倒排索引項組成。
- 倒排索引項：
  - 文檔Id
  - 詞頻TF：該單詞在文檔中出現的次數，用於相關性評分。
  - 位置（Position）：單詞在文檔中分詞的位置。用於語句搜索（phrase query）
  - 偏移（Offset）：記錄單詞的開始結束位置，實現高亮顯示。

3.2 示例

在如下文檔中搜索Elasticsearch

文檔內容

文檔Id	文檔內容
1	Mastering Elasticsearch
2	Elasticsearch Server
3	Elasticsearch Essentials

倒排列表

文檔Id	詞頻TF	位置	偏移
1	1	1	<10,23>
2	1	0	<0,13>
3	1	0	<0,13>

3.3 Elasticsearch的倒排索引

Elasticsearch的JSON文檔的每一個字段，都有本身的倒排索引。
能夠指定對某些字段不作索引。
- 優勢：節省存儲空間
- 缺點：字段沒法被搜索

4. 經過Analyzer進行分詞

4.1 Analysis 與 Analyzer

Analysis：
- 文本分析，是把全文本轉換爲一系列單詞( term/ token )的過程，也叫分詞。
- 是經過Analyzer實現的，能夠是內置分詞器，也能夠是定製分詞器。
- 除了在數據寫入時轉換詞條，匹配Query語句時也須要用相同的分析器對查詢語句進行分析。
Analyzer：
- 分析器：一個分析器包括一個可選的字符過濾器、一個單個分詞器、0個或多個分詞過濾器。
- Analyzer由三部分組成。
  - Character Filters：字符過濾器，使用字符過濾器轉變字符。
  - Tokenizer：分詞器，將文本切分爲單個或者多個分詞。
  - Token Filter：分詞過濾器，使用分詞過濾器轉變每一個分詞（小寫、停用詞、同義詞）

4.2 Elasticsearch內置分詞器

Standard Analayzer：默認 分詞器，按詞切分，小寫處理。
Simple Analayzer：按照非字母切分（符號被過濾），小寫處理。
Stop Analayzer：小寫處理，停用詞過濾（the，a，is）
Whitespace Analyzer：按照空格切分，不轉小寫。
Keyword Analyzer：不分詞，直接將輸入看成輸出。
Pattern Analayzer：正則表達式，默認W+（非字符分隔）
Langugae：提供30多種常見語言的分詞器。
Customer Analyzer；自定義分詞器

4.2.1 _analyze API

GET /_analyze
POST /_analyze
GET /<index>/_analyze
POST /<index>/_analyze

demo：

POST _analyze
  {
    "analyzer": "standard",
    "text": ["share your experience with NoSql & big data technologies"]
  }

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。