Elasticsearch的Mapping配置

時間 2020-12-06

標籤 node git github 算法數據庫 json bootstrap 數組數據結構 app 欄目日誌分析简体版

原文原文鏈接

本文已經收錄至個人GitHub,歡迎你們踊躍star 和 issues。node

https://github.com/midou-tech/articlesgit

題外話
此次原本是準備用filebeat寫數據到es，而後下一篇寫查詢語法和一些查詢操做。github

就在我要寫數據的時候，發現不對啊。mapping配置什麼的都不知道，只是把數據塞進去了，徹底不知道數據怎麼結構化存儲的，也不知道怎麼查詢。算法

通常去對接es業務，都須要告訴es的同窗，有哪些字段，那些字段須要作查詢，es的同窗會根據你的業務去配置相應的mapping。數據庫

學習es請關注龍叔，帶你走進es的世界。json

正文
es配置這塊做爲es開發人員和維護人員屬於基本知識，必須掌握的。做爲es的業務方和使用者，瞭解es的配置有助於更好的在你的場景中去使用。bootstrap

本篇文章主要講下es的配置文件、mapping配置問題。數組

elasticsearch的目錄結構

這是Elasticsearch 7.7.0版本的目錄結構數據結構

bin：腳本文件，包括 ES 啓動 & 安裝插件等等
config：elasticsearch.yml（ES 配置文件）、jvm.options（JVM 配置文件）、日誌配置文件等等
data：ES 啓動的時候，會有該目錄，用來存儲文檔數據，該目錄能夠設置
jdk.app：內置的 JDK， 7.7.0內置了openjdk 14
lib：類庫
logs：日誌文件
modules：ES 全部模塊，包括 X-pack 等
plugins：ES 已經安裝的插件，默認沒有任何插件
config目錄
其實熟悉工程的朋友都知道，一個工程的目錄配置都是大同小異的。

bin目錄存放一些必要的二進制文件或者啓動腳本、config目錄存放工程須要的配置文件、log目錄就是日誌文件、lib目錄放一些工程必須的庫、script目錄放一些腳本。app

elasticsearch.yml是elasticsearch的重要配置，下面說一下這個文件的一些配置項

# ---------------------------------- Cluster -----------------------------------
# Use a descriptive name for your cluster:
cluster.name: my-application
# ------------------------------------ Node ------------------------------------
# Use a descriptive name for the node:
node.name: node-1
# Add custom attributes to the node:
node.attr.rack: r1
# ----------------------------------- Paths ------------------------------------
# Path to directory where to store the data (separate multiple locations by comma):
path.data: /path/to/data
# Path to log files:
path.logs: /path/to/logs
# ----------------------------------- Memory -----------------------------------
# Lock the memory on startup:
bootstrap.memory_lock: true
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this limit.
# Elasticsearch performs poorly when the system is swapping the memory.
# ---------------------------------- Network -----------------------------------
# Set the bind address to a specific IP (IPv4 or IPv6):
network.host: 192.168.0.1
# Set a custom port for HTTP:
http.port: 9200
# For more information, consult the network module documentation.
# --------------------------------- Discovery ----------------------------------
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
discovery.seed_hosts: ["host1", "host2"]
# Bootstrap the cluster using an initial set of master-eligible nodes:
cluster.initial_master_nodes: ["node-1", "node-2"]
# For more information, consult the discovery and cluster formation module documentation.
# ---------------------------------- Gateway -----------------------------------
# Block initial recovery after a full cluster restart until N nodes are started:
gateway.recover_after_nodes: 3
# For more information, consult the gateway module documentation.
# ---------------------------------- Various -----------------------------------
# Require explicit names when deleting indices:
action.destructive_requires_name: true

原本還想註釋解釋一下的，結果看了下這裏面的英文解釋，很so easy，就不在這裏贅述了。

Es的Mapping
Mapping相似於數據庫中的表結構定義，咱們建立一個表須要定義一個表結構。

在es裏面每一個索引都會有一個mapping配置。

主要做用以下：

定義Index下字段名（Field Name）
定義字段的類型，好比數值型，字符串型、布爾型等
定義倒排索引的相關配置，好比是否索引、記錄postion等
Mapping完整的內容能夠分爲四部份內容：
字段類型(Field datatypes)
元字段(Meta-Fields)
Mapping參數配置(Mapping parameters)
動態Mapping(Dynamic Mapping)
下面就對這四部分進行講解

上一篇Es文章中簡單的寫入數據到es中，寫的時候沒有配置任何mapping結構的，不知道是否還記得以前寫的數據。

別管原理，先run起來（戳我查看），貼了連接，裏面有插入數據的語法。

太長了，這裏就不截插入數據的圖了，有興趣的戳連接查看。

GET /user/_mapping 使用這個接口能夠獲取user索引的mapping結構以下：

{
  "user" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

能夠看到，默認狀況下，插入數據是會自動建立mapping結構的。

三個字段name、titile、desc都自動識別爲text類型，同時分別增長了keyword字段，類型爲keyword。

GET /user/_mapping/field/title 能夠獲取字段name的mapping結構。

{
  "user" : {
    "mappings" : {
      "title" : {
        "full_name" : "title",
        "mapping" : {
          "title" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

上面展現瞭如何獲取mapping，能夠經過獲取整個index的mapping，也能夠獲取某個字段的mapping。

數據類型
主要的數據類型
string ：text and keyword 兩種

text（文本類型），在索引文件中，存儲的不是原字符串，而是使用分詞器對內容進行分詞處理後獲得一系列的詞根，而後一一存儲在index的倒排索引中。
keyword（關鍵字類型），將原始輸入內容當成一個詞根存儲在倒排索引中，與text字段的區別是該字段不會使用分詞器進行分詞。
Numeric

long, integer, short, byte, double, float, half_float, scaled_float

Date

date 日期類型

Date nanoseconds

date_nanos

Boolean

boolean

Binary

binary

Range

integer_range, float_range, long_range, double_range, date_range, ip_range

複雜數據類型
Object：單個json對象

Nested ：嵌套json對象

地理位置數據類型
Geo-point ：地理經緯度

Geo-shape ：多邊形等複雜形狀

專用數據類型
IP

支持IPv4 and IPv6 地址，這個用的蠻多的，日誌查詢常常用到。

Completion datatype

範圍類型，爲了優化在查找時輸入自動補全而設計的類型，輸入自動補全會在查詢部分專題詳解。

Token count

令牌計數類型，接收一個字符串通過分析後將返回詞根的個數

mapper-murmur3

須要安裝 map-per-murmur3插件，提供了在索引時索引和記錄該字段的散列值，對於聚合有性能提高

Join

join類型容許在同一個索引中（同一個類型type）中定義多個不一樣類型的文檔（例如學生文檔、班級文檔-）這些類型是個一對多關聯關係（父子級聯關係）。

Alias

指定字段別名

Arrays

數組類型

range datatype

數據範圍類型，一個字段表示一個範圍，例如

Meta-Fields(元字段)
身份元字段

_index , _type , _id

_index:文檔所屬索引 , 自動被索引，可被查詢，聚合，排序使用，或者腳本里訪問

type：文檔所屬類型，自動被索引，可被查詢，聚合，排序使用，或者腳本里訪問

_id：文檔的惟一標識，建索引時候傳入，不被索引，可經過_uid被查詢，腳本里使用，不能參與聚合或排序

數據源元字段

_source , _size

_source : 一個doc的原生的json數據，不會被索引，用於獲取提取字段值，啓動此字段，索引體積會變大，若是既想使用此字段又想兼顧索引體積，能夠開啓索引壓縮。

_size : 整個_source字段的字節數大小

索引元字段

_field_names , _ignored

_field_names：索引了每一個字段的名字，能夠包含null值，能夠經過exists查詢或missing查詢方法來校驗特定的字段

_ignored: 默認狀況下，嘗試將錯誤的數據類型索引到字段中會引起異常，並拒絕整個文檔。ignore_malformed若是將參數設置爲true，則能夠忽略異常。格式錯誤的字段未編制索引，但文檔中的其餘字段已正常處理。

路由元字段

_routing

其餘元字段

_meta

Mapping parameters(mapping參數)

analyzer
boost
coerce
copy_to
doc_values
dynamic
eager_global_ordinals
enabled
fielddata
fields
format
ignore_above
ignore_malformed
index_options
index_phrases
index_prefixes
index
meta
normalizer
norms
null_value
position_increment_gap
properties
search_analyzer
similarity
store
term_vector
mapping的參數很是多，說幾個經常使用的參數，其餘參數使用時在去看文檔

analyzer

指定分詞器。elasticsearch是一款支持全文檢索的分佈式存儲系統，對於text類型的字段，首先會使用分詞器進行分詞，而後將分詞後的詞根一個一個存儲在倒排索引中，後續查詢主要是針對詞根的搜索。

analyzer參數能夠在每一個查詢、每一個字段、每一個索引中使用，其優先級以下（越靠前越優先）：一、字段上定義的分詞器二、索引配置中定義的分詞器三、默認分詞器(standard)

在查詢上下文，分詞器的查找優先爲：一、full-text query中定義的分詞器二、定義類型映射時，字段中search_analyzer 定義的分詞器三、定義字段映射時analyzer定義的分詞器四、索引中default_search中定義的分詞器五、索引中默認定義的分詞器六、標準分詞器（standard）

coerce

是否進行類型「隱式轉換」。例如

"fans":{
  "type":"integer"
}

聲明粉絲數是整形，若是coerce:true "10000" 也能夠寫進fans字段，coerce:false 則必須使用 1000 賦值給fans字段。

boost

權重值，能夠提高在查詢時的權重，對查詢相關性有直接的影響，其默認值爲1.0。其影響範圍爲詞根查詢(team query),對前綴、範圍查詢、全文索引(match query)不生效。

copy_to

copy_to參數容許您建立自定義的_all字段。換句話說，多個字段的值能夠複製到一個字段中。

{
  "user" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text",
          "analyzer" : "standard",
          "copy_to" : "content"
        },
        "fans" : {
          "type" : "integer",
          "copy_to" : "content"
        },
        "name" : {
          "type" : "text",
          "analyzer" : "standard",
          "copy_to" : "content"
        },
        "content" : {
          "type" : "text"
        }
      }
    }
  }
}

content的值爲 desc、fans、name字段值拼接的結果。

copy_to注意事項：一、字段的複製是原始值，而不是分詞後的詞根。二、複製字段不會包含在_souce字段中，但可使用複製字段進行查詢。三、同一個字段能夠複製到多個字段，寫法以下：「copy_to」: [ 「field_1」, 「field_2」 ]

doc_values

當須要對一個字段進行排序時，es須要提取匹配結果集中的排序字段值集合，而後進行排序。倒排索引的數據結構對檢索來講至關高效，但對排序就不那麼擅長。

dynamic

是否容許動態的隱式增長字段。

normalizer

規劃化，主要針對keyword類型，在索引該字段或查詢字段以前，能夠先對原始數據進行一些簡單的處理，而後再將處理後的結果當成一個詞根存入倒排索引中

enabled

是否創建索引，默認狀況下es會嘗試爲你索引全部的字段，但有時候某些類型的字段，無需創建索引，只是用來存儲數據便可。

eager_global_ordinals

全局序列號，它以字典順序爲每一個惟一的術語保持遞增的編號。

fielddata

爲了解決排序與聚合，elasticsearch提供了doc_values屬性來支持列式存儲，但doc_values不支持text字段類型。

由於text字段是須要先分析（分詞），會影響doc_values列式存儲的性能。es爲了支持text字段高效排序與聚合，引入了一種新的數據結構(fielddata)，使用內存進行存儲。

format

在JSON文檔中，日期表示爲字符串。

{
  "user": {
    "mappings": {
      "properties": {
        "desc": {
          "type": "text",
          "analyzer": "standard",
          "copy_to": "content"
        },
        "fans": {
          "type": "integer",
          "copy_to": "content"
        },
        "name": {
          "type": "text",
          "analyzer": "standard",
          "copy_to": "content"
        },
        "content": {
          "type": "text"
        },
        "date": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  }
}

fields

fields容許對同一索引中同名的字段進行不一樣的設置

index

定義字段是否索引，true:表明索引，false表示不索引（則沒法經過該字段進行查詢），默認值爲true。

index_options

控制文檔添加到反向索引的額外內容，能夠選擇的參數以下：

docs：文檔編號添加到倒排索引。freqs：文檔編號與訪問頻率。positions：文檔編號、訪問頻率、詞位置（順序性），proximity 和phrase queries 須要用到該模式。offsets：文檔編號，詞頻率，詞偏移量（開始和結束位置）和詞位置（序號），高亮顯示，須要設置爲該模式。

norms

字段的評分規範，存儲該規範信息，會提升查詢時的評分計算效率

null_value

將顯示的null值替換爲新定義的額外值

search_analyzer