Elasticsearch 6.x Mapping設置

時間 2019-11-09

標籤 elasticsearch 6.x mapping 設置欄目日誌分析简体版

原文原文鏈接

Mapping

相似於數據庫中的表結構定義，主要做用以下：html

定義Index下字段名（Field Name）
定義字段的類型，好比數值型，字符串型、布爾型等
定義倒排索引的相關配置，好比是否索引、記錄postion等

須要注意的是，在索引中定義太多字段可能會致使索引膨脹，出現內存不足和難以恢復的狀況，下面有幾個設置：正則表達式

index.mapping.total_fields.limit：一個索引中能定義的字段的最大數量，默認是 1000
index.mapping.depth.limit：字段的最大深度，之內部對象的數量來計算，默認是20
index.mapping.nested_fields.limit：索引中嵌套字段的最大數量，默認是50

數據類型

核心數據類型

字符串 - text
- 用於全文索引，該類型的字段將經過分詞器進行分詞，最終用於構建索引
字符串 - keyword
- 不分詞，只能搜索該字段的完整的值，只用於 filtering
數值型
- long：有符號64-bit integer：-2^63 ~ 2^63 - 1
- integer：有符號32-bit integer，-2^31 ~ 2^31 - 1
- short：有符號16-bit integer，-32768 ~ 32767
- byte：有符號8-bit integer，-128 ~ 127
- double：64-bit IEEE 754 浮點數
- float：32-bit IEEE 754 浮點數
- half_float：16-bit IEEE 754 浮點數
- scaled_float
布爾 - boolean
- 值：false, "false", true, "true"
日期 - date
- 因爲Json沒有date類型，因此es經過識別字符串是否符合format定義的格式來判斷是否爲date類型
- format默認爲：strict_date_optional_time||epoch_millis format
二進制 - binary
- 該類型的字段把值當作通過 base64 編碼的字符串，默認不存儲，且不可搜索
範圍類型
- 範圍類型表示值是一個範圍，而不是一個具體的值
- 譬如 age 的類型是 integer_range，那麼值能夠是 {"gte" : 10, "lte" : 20}；搜索 "term" : {"age": 15} 能夠搜索該值；搜索 "range": {"age": {"gte":11, "lte": 15}} 也能夠搜索到
- range參數 relation 設置匹配模式
  - INTERSECTS ：默認的匹配模式，只要搜索值與字段值有交集便可匹配到
  - WITHIN：字段值須要徹底包含在搜索值以內，也就是字段值是搜索值的子集才能匹配
  - CONTAINS：與WITHIN相反，只搜索字段值包含搜索值的文檔
- integer_range
- float_range
- long_range
- double_range
- date_range：64-bit 無符號整數，時間戳（單位：毫秒）
- ip_range：IPV4 或 IPV6 格式的字符串

# 建立range索引
PUT range_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

# 插入一個文檔
PUT range_index/_doc/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-05"
  }
}

# 12在 10~20的範圍內，能夠搜索到文檔1
GET range_index/_search
{
  "query" : {
    "term" : {
      "expected_attendees" : {
        "value": 12
      }
    }
  }
}

# within能夠搜索到文檔
# 能夠修改日期，而後分別對比CONTAINS，WITHIN，INTERSECTS的區別
GET range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-11-02",
        "lte" : "2015-11-03",
        "relation" : "within" 
      }
    }
  }
}
複製代碼

複雜數據類型

數組類型 Array
- 字符串數組 [ "one", "two" ]
- 整數數組 [ 1, 2 ]
- 數組的數組 [ 1, [ 2, 3 ]]，至關於 [ 1, 2, 3 ]
- Object對象數組 [ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }]
- 同一個數組只能存同類型的數據，不能混存，譬如 [ 10, "some string" ] 是錯誤的
- 數組中的 null 值將被 null_value 屬性設置的值代替或者被忽略
- 空數組 [] 被當作 missing field 處理
對象類型 Object
- 對象類型可能有內部對象
- 被索引的形式爲：manager.name.first

# tags字符串數組，lists 對象數組
PUT my_index/_doc/1
{
  "message": "some arrays in this document...",
  "tags":  [ "elasticsearch", "wow" ], 
  "lists": [ 
    {
      "name": "prog_list",
      "description": "programming list"
    },
    {
      "name": "cool_list",
      "description": "cool stuff list"
    }
  ]
}

複製代碼

嵌套類型 Nested
- nested 類型是一種對象類型的特殊版本，它容許索引對象數組，獨立地索引每一個對象

嵌套類型與Object類型的區別

經過例子來講明:算法

插入一個文檔，不設置mapping，此時 user 字段被自動識別爲對象數組

DELETE my_index

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

複製代碼

查詢 user.first爲 Alice，user.last 爲 Smith的文檔，理想中應該找不到匹配的文檔
結果是查到了文檔1，爲何呢？

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}
複製代碼

是因爲Object對象類型在內部被轉化成以下格式的文檔：

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}
複製代碼

user.first 和 user.last 扁平化爲多值字段，alice 和 white 的關聯關係丟失了。致使這個文檔錯誤地匹配對 alice 和 smith 的查詢數據庫
若是最開始就把user設置爲 nested 嵌套對象呢？json

DELETE my_index
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "group": "fans",
  "user": [
    {
      "first": "John",
      "last": "Smith"
    },
    {
      "first": "Alice",
      "last": "White"
    }
  ]
}
複製代碼

再來進行查詢，能夠發現如下第一個查不到文檔，第二個查詢到文檔1，符合咱們預期

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}
複製代碼

nested對象將數組中每一個對象做爲獨立隱藏文檔來索引，這意味着每一個嵌套對象均可以獨立被搜索數組
須要注意的是：bash

使用 nested 查詢來搜索
使用 nested 和 reverse_nested 聚合來分析
使用 nested sorting 來排序
使用 nested inner hits 來檢索和高亮

地理位置數據類型

geo_point
- 地理位置，其值能夠有以下四中表現形式：
  - object對象："location": {"lat": 41.12, "lon": -71.34}
  - 字符串："location": "41.12,-71.34"
  - geohash："location": "drm3btev3e86"
  - 數組："location": [ -71.34, 41.12 ]
- 查詢的時候經過 Geo Bounding Box Query 進行查詢
geo_shape

專用數據類型

記錄IP地址 ip
實現自動補全 completion
記錄分詞數 token_count
記錄字符串hash值 murmur3
Percolator

# ip類型，存儲IP
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

複製代碼

多字段特性 multi-fields

容許對同一個字段採用不一樣的配置，好比分詞，常見例子如對人名實現拼音搜索，只須要在人名中新增一個子字段爲 pinyin 便可
經過參數 fields 設置

設置Mapping

GET my_index/_mapping

# 結果
{
  "my_index": {
    "mappings": {
      "doc": {
        "properties": {
          "age": {
            "type": "integer"
          },
          "created": {
            "type": "date"
          },
          "name": {
            "type": "text"
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  }
}
複製代碼

Mapping參數

analyzer

分詞器，默認爲standard analyzer，當該字段被索引和搜索時對字段進行分詞處理

boost

字段權重，默認爲1.0

dynamic

Mapping中的字段類型一旦設定後，禁止直接修改，緣由是：Lucene實現的倒排索引生成後不容許修改
只能新建一個索引，而後reindex數據
默認容許新增字段
經過dynamic參數來控制字段的新增：
- true（默認）容許自動新增字段
- false 不容許自動新增字段，可是文檔能夠正常寫入，但沒法對新增字段進行查詢等操做
- strict 文檔不能寫入，報錯

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}
複製代碼

定義後my_index這個索引下不能自動新增字段，可是在user.social_networks下能夠自動新增子字段微信

copy_to

將該字段複製到目標字段，實現相似_all的做用
不會出如今_source中，只用來搜索

DELETE my_index
PUT my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/doc/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}
複製代碼

index

控制當前字段是否索引，默認爲true，即記錄索引，false不記錄，即不可搜索

index_options

index_options參數控制將哪些信息添加到倒排索引，以用於搜索和突出顯示，可選的值有：docs，freqs，positions，offsets
docs：只索引 doc id
freqs：索引 doc id 和詞頻，平分時可能要用到詞頻
positions：索引 doc id、詞頻、位置，作 proximity or phrase queries 時可能要用到位置信息
offsets：索引doc id、詞頻、位置、開始偏移和結束偏移，高亮功能須要用到offsets

fielddata

是否預加載 fielddata，默認爲false
Elasticsearch第一次查詢時完整加載這個字段全部 Segment 中的倒排索引到內存中
若是咱們有一些 5 GB 的索引段，並但願加載 10 GB 的 fielddata 到內存中，這個過程可能會要數十秒
將 fielddate 設置爲 true ,將載入 fielddata 的代價轉移到索引刷新的時候，而不是查詢時，從而大大提升了搜索體驗
參考：預加載 fielddata

eager_global_ordinals

是否預構建全局序號，默認false
參考：預構建全局序號（Eager global ordinals）

doc_values

參考：Doc Values and Fielddata

fields

該參數的目的是爲了實現 multi-fields
一個字段，多種數據類型
譬如：一個字段 city 的數據類型爲 text ，用於全文索引，能夠經過 fields 爲該字段定義 keyword 類型，用於排序和聚合

# 設置 mapping
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

# 插入兩條數據
PUT my_index/_doc/1
{
  "city": "New York"
}

PUT my_index/_doc/2
{
  "city": "York"
}

# 查詢，city用於全文索引 match，city.raw用於排序和聚合
GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}
複製代碼

format

因爲JSON沒有date類型，Elasticsearch預先經過format參數定義時間格式，將匹配的字符串識別爲date類型，轉換爲時間戳（單位：毫秒）
format默認爲：strict_date_optional_time||epoch_millis
Elasticsearch內建的時間格式:

名稱	格式
epoch_millis	時間戳（單位：毫秒）
epoch_second	時間戳（單位：秒）
date_optional_time
basic_date	yyyyMMdd
basic_date_time	yyyyMMdd'T'HHmmss.SSSZ
basic_date_time_no_millis	yyyyMMdd'T'HHmmssZ
basic_ordinal_date	yyyyDDD
basic_ordinal_date_time	yyyyDDD'T'HHmmss.SSSZ
basic_ordinal_date_time_no_millis	yyyyDDD'T'HHmmssZ
basic_time	HHmmss.SSSZ
basic_time_no_millis	HHmmssZ
basic_t_time	'T'HHmmss.SSSZ
basic_t_time_no_millis	'T'HHmmssZ

上述名稱加前綴 strict_ 表示爲嚴格格式
更多的查看文檔

properties

用於_doc，object和nested類型的字段定義子字段

PUT my_index
{
  "mappings": {
    "_doc": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

複製代碼

normalizer

與 analyzer 相似，只不過 analyzer 用於 text 類型字段，分詞產生多個 token，而 normalizer 用於 keyword 類型，只產生一個 token（整個字段的值做爲一個token，而不是分詞拆分爲多個token）app
定義一個自定義 normalizer，使用大寫uppercase過濾器elasticsearch

PUT test_index_4
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["uppercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

# 插入數據
POST test_index_4/_doc/1
{
  "foo": "hello world"
}

POST test_index_4/_doc/2
{
  "foo": "Hello World"
}

POST test_index_4/_doc/3
{
  "foo": "hello elasticsearch"
}

# 搜索hello，結果爲空，而不是3條！！ 
GET test_index_4/_search
{
  "query": {
    "match": {
      "foo": "hello"
    }
  }
}

# 搜索 hello world，結果2條，1 和 2
GET test_index_4/_search
{
  "query": {
    "match": {
      "foo": "hello world"
    }
  }
}
複製代碼

其餘字段

coerce
- 強制類型轉換，把json中的值轉爲ES中字段的數據類型，譬如：把字符串"5"轉爲integer的5
- coerce默認爲 true
- 若是coerce設置爲 false，當json的值與es字段類型不匹配將會 rejected
- 經過 "settings": { "index.mapping.coerce": false } 設置索引的 coerce
enabled
- 是否索引，默認爲 true
- 能夠在_doc和字段兩個粒度進行設置
ignore_above
- 設置能被索引的字段的長度
- 超過這個長度，該字段將不被索引，因此沒法搜索，但聚合的terms能夠看到
null_value
- 該字段定義遇到null值時的處理策略，默認爲Null，即空值，此時ES會忽略該值
- 經過設定該值能夠設定字段爲 null 時的默認值
ignore_malformed
- 當數據類型不匹配且 coerce 強制轉換時,默認狀況會拋出異常,並拒絕整個文檔的插入
- 若設置該參數爲 true，則忽略該異常，並強制賦值，可是不會被索引，其餘字段則照常
norms
- norms 存儲各類標準化因子，爲後續查詢計算文檔對該查詢的匹配分數提供依據
- norms 參數對評分頗有用，但須要佔用大量的磁盤空間
- 若是不須要計算字段的評分，能夠取消該字段 norms 的功能
position_increment_gap
- 與 proximity queries（近似查詢）和 phrase queries（短語查詢）有關
- 默認值 100
search_analyzer
- 搜索分詞器，查詢時使用
- 默認與 analyzer 同樣
similarity
- 設置相關度算法，ES5.x 和 ES6.x 默認的算法爲 BM25
- 另外也可選擇 classic 和 boolean
store
- store 的意思是：是否在 _source 以外在獨立存儲一份，默認值爲 false
- es在存儲數據的時候把json對象存儲到"_source"字段裏，"_source"把全部字段保存爲一份文檔存儲（讀取須要1次IO），要取出某個字段則經過 source filtering 過濾
- 當字段比較多或者內容比較多，而且不須要取出全部字段的時候，能夠把特定字段的store設置爲true單獨存儲（讀取須要1次IO），同時在_source設置exclude
- 關於該字段的理解，參考： es設置mapping store屬性
term_vector
- 與倒排索引相關

Dynamic Mapping

ES是依靠JSON文檔的字段類型來實現自動識別字段類型，支持的類型以下：

JSON 類型	ES 類型
null	忽略
boolean	boolean
浮點類型	float
整數	long
object	object
array	由第一個非 null 值的類型決定
string	匹配爲日期則設爲date類型（默認開啓）；匹配爲數字則設置爲 float或long類型（默認關閉）；設爲text類型，並附帶keyword的子字段

舉栗子

POST my_index/doc
{
  "username":"whirly",
  "age":22,
  "birthday":"1995-01-01"
}
GET my_index/_mapping

# 結果
{
  "my_index": {
    "mappings": {
      "doc": {
        "properties": {
          "age": {
            "type": "long"
          },
          "birthday": {
            "type": "date"
          },
          "username": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}
複製代碼

日期的自動識別

dynamic_date_formats 參數爲自動識別的日期格式，默認爲 [ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]
date_detection能夠關閉日期自動識別機制

# 自定義日期識別格式
PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}
# 關閉日期自動識別機制
PUT my_index
{
  "mappings": {
    "_doc": {
      "date_detection": false
    }
  }
}
複製代碼

數字的自動識別

字符串是數字時，默認不會自動識別爲整形，由於字符串中出現數字徹底是合理的
numeric_detection 參數能夠開啓字符串中數字的自動識別

Dynamic templates

容許根據ES自動識別的數據類型、字段名等來動態設定字段類型，能夠實現以下效果：

全部字符串類型都設定爲keyword類型，即不分詞
全部以message開頭的字段都設定爲text類型，即分詞
全部以long_開頭的字段都設定爲long類型
全部自動匹配爲double類型的都設定爲float類型，以節省空間

Dynamic templates API

"dynamic_templates": [
    {
      "my_template_name": { 
        ...  match conditions ... 
        "mapping": { ... } 
      }
    },
    ...
]
複製代碼

匹配規則通常有以下幾個參數：

match_mapping_type 匹配ES自動識別的字段類型，如boolean，long，string等
match, unmatch 匹配字段名
match_pattern 匹配正則表達式
path_match, path_unmatch 匹配路徑

# double類型的字段設定爲float以節省空間
PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "integers": {
            "match_mapping_type": "double",
            "mapping": {
              "type": "float"
            }
          }
        }
      ]
    }
  }
}
複製代碼

自定義Mapping的建議

寫入一條文檔到ES的臨時索引中，獲取ES自動生成的Mapping
修改步驟1獲得的Mapping，自定義相關配置
使用步驟2的Mapping建立實際所需索引

Index Template 索引模板

索引模板，主要用於在新建索引時自動應用預先設定的配置，簡化索引建立的操做步驟
- 能夠設定索引的setting和mapping
- 能夠有多個模板，根據order設置，order大的覆蓋小的配置
索引模板API，endpoint爲 _template

# 建立索引模板，匹配 test-index-map 開頭的索引
PUT _template/template_1
{
  "index_patterns": ["test-index-map*"],
  "order": 2,
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "doc": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "name": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date",
          "format": "YYYY/MM/dd HH:mm:ss"
        }
      }
    }
  }
}

# 插入一個文檔
POST test-index-map_1/doc
{
  "name" : "小旋鋒",
  "created_at": "2018/08/16 20:11:11"
}

# 獲取該索引的信息，能夠發現 settings 和 mappings 和索引模板裏設置的同樣
GET test-index-map_1

# 刪除
DELETE /_template/template_1

# 查詢
GET /_template/template_1
複製代碼