Elasticsearch數據類型及其屬性

時間 2019-11-12

原文原文鏈接

1、數據類型html

字段類型概述git

一級分類	二級分類	具體類型
核心類型	字符串類型	string,text,keyword
h	整數類型	integer,long,short,byte
h	浮點類型	double,float,half_float,scaled_float
h	邏輯類型	boolean
h	日期類型	date
h	範圍類型	range
h	二進制類型	binary
複合類型	數組類型	array
f	對象類型	object
f	嵌套類型	nested
地理類型	地理座標類型	geo_point
d	地理地圖	geo_shape
特殊類型	IP類型	ip
t	範圍類型	completion
t	令牌計數類型	token_count
t	附件類型	attachment
t	抽取類型	percolator

核心類型web

一、字符串類型
　　string類型: 在ElasticSearch 舊版本中使用較多，從ElasticSearch 5.x開始再也不支持string，由text和keyword類型替代。
　　text 類型：當一個字段是要被全文搜索的，好比Email內容、產品描述，應該使用text類型。設置text類型之後，字段內容會被分析，在生成倒排索引之前，字符串會被分析器分紅一個一個詞項。text類型的字段不用於排序，不多用於聚合。
　　keyword
keyword類型適用於索引結構化的字段，好比email地址、主機名、狀態碼和標籤。若是字段須要進行過濾(好比查找已發佈博客中status屬性爲published的文章)、排序、聚合。keyword類型的字段只能經過精確值搜索到。算法

二、整數類型數組

類型	取值範圍
byte	-128~127
short	-32768~32767
integer	-231~231-1
short	-263~263-1

在知足需求的狀況下，儘量選擇範圍小的數據類型。好比，某個字段的取值最大值不會超過100，那麼選擇byte類型便可。迄今爲止吉尼斯記錄的人類的年齡的最大值爲134歲，對於年齡字段，short足矣。字段的長度越短，索引和搜索的效率越高。app

三、浮點類型elasticsearch

類型	取值範圍
doule	64位雙精度IEEE 754浮點類型
float	32位單精度IEEE 754浮點類型
half_float	16位半精度IEEE 754浮點類型
scaled_float	縮放類型的的浮點數

對於float、half_float和scaled_float,-0.0和+0.0是不一樣的值，使用term查詢查找-0.0不會匹配+0.0，一樣range查詢中上邊界是-0.0不會匹配+0.0，下邊界是+0.0不會匹配-0.0。ide

其中scaled_float，好比價格只須要精確到分，price爲57.34的字段縮放因子爲100，存起來就是5734
優先考慮使用帶縮放因子的scaled_float浮點類型。性能

四、date類型
日期類型表示格式能夠是如下幾種：
（1）日期格式的字符串，好比「2018-01-13」或「2018-01-13 12:10:30」
（2）long類型的毫秒數( milliseconds-since-the-epoch，epoch就是指UNIX誕生的UTC時間1970年1月1日0時0分0秒)
（3）integer的秒數(seconds-since-the-epoch)大數據
五、boolean類型　true和false
六、 binary類型
　　進制字段是指用base64來表示索引中存儲的二進制數據，可用來存儲二進制形式的數據，例如圖像。默認狀況下，該類型的字段只存儲不索引。二進制類型只支持index_name屬性。
七、array類型
（1）字符數組: [ 「one」, 「two」 ]
（2）整數數組: productid:[ 1, 2 ]
（3）對象（文檔）數組: 「user」:[ { 「name」: 「Mary」, 「age」: 12 }, { 「name」: 「John」, 「age」: 10 }]，
注意：lasticSearch不支持元素爲多個數據類型：[ 10, 「some string」 ]
八、 object類型
JSON對象，文檔會包含嵌套的對象
九、ip類型
p類型的字段用於存儲IPv4或者IPv6的地址

2、Mapping 支持屬性

一、enabled：僅存儲、不作搜索和聚合分析
```
"enabled":true （缺省）| false
```
二、index：是否構建倒排索引（便是否分詞，設置false，字段將不會被索引）
```
"index": true（缺省）| false
```

三、index_option：存儲倒排索引的哪些信息

4個可選參數：
      docs：索引文檔號
      freqs：文檔號+詞頻
      positions：文檔號+詞頻+位置，一般用來距離查詢
      offsets：文檔號+詞頻+位置+偏移量，一般被使用在高亮字段
  分詞字段默認是positions，其餘默認時docs
  
  "index_options": "docs"

四、norms：是否歸一化相關參數、若是字段僅用於過濾和聚合分析、可關閉
分詞字段默認配置，不分詞字段：默認{「enable」: false}，存儲長度因子和索引時boost，建議對須要參加評分字段使用，會額外增長內存消耗
```
"norms": {"enable": true, "loading": "lazy"}
```
五、doc_value：是否開啓doc_value，用戶聚合和排序分析
對not_analyzed字段，默認都是開啓，分詞字段不能使用，對排序和聚合能提高較大性能，節約內存
```
"doc_value": true（缺省）| false
```
六、fielddata：是否爲text類型啓動fielddata，實現排序和聚合分析
針對分詞字段，參與排序或聚合時能提升性能，不分詞字段統一建議使用doc_value
```
"fielddata": {"format": "disabled"}
```
七、store：是否單獨設置此字段的是否存儲而從_source字段中分離，只能搜索，不能獲取值
```
"store": false（默認）| true
```
八、coerce：是否開啓自動數據類型轉換功能，好比：字符串轉數字，浮點轉整型
```
"coerce: true（缺省）| false"
```
九、multifields：靈活使用多字段解決多樣的業務需求
十一、dynamic：控制mapping的自動更新
```
"dynamic": true（缺省）| false | strict
```

十二、data_detection：是否自動識別日期類型
```
"data_detection"：true（缺省）| false
```

dynamic和data_detection的詳解：Elasticsearch dynamic mapping（動態映射）策略.

1三、analyzer：指定分詞器，默認分詞器爲standard analyzer
```
"analyzer": "ik"
```
1四、boost：字段級別的分數加權，默認值是1.0
```
"boost": 1.23
```
1五、fields：能夠對一個字段提供多種索引模式，同一個字段的值，一個分詞，一個不分詞
```
"fields": {"raw": {"type": "string", "index": "not_analyzed"}}
```
1六、ignore_above：超過100個字符的文本，將會被忽略，不被索引
```
"ignore_above": 100
```
1七、include_in_all：設置是否此字段包含在_all字段中，默認時true，除非index設置成no
```
"include_in_all": true
```
1八、null_value：設置一些缺失字段的初始化，只有string可使用，分詞字段的null值也會被分詞
```
"null_value": "NULL"
```
1九、position_increament_gap：影響距離查詢或近似查詢，能夠設置在多值字段的數據上或分詞字段上，查詢時能夠指定slop間隔，默認值時100
```
"position_increament_gap": 0
```
20、search_analyzer：設置搜索時的分詞器，默認跟analyzer是一致的，好比index時用standard+ngram，搜索時用standard用來完成自動提示功能
```
"search_analyzer": "ik"
```
2一、similarity：默認時TF/IDF算法，指定一個字段評分策略，僅僅對字符串型和分詞類型有效
```
"similarity": "BM25"
```
2二、trem_vector：默認不存儲向量信息，支持參數yes（term存儲），with_positions（term+位置），with_offsets（term+偏移量），with_positions_offsets（term+位置+偏移量）對快速高亮fast vector highlighter能提高性能，但開啓又會加大索引體積，不適合大數據量用
```
"trem_vector": "no"
```

3、Mapping 字段設置流程

avatar

----------------------------

說在前面: Elasticsearch中每一個field都要精確對應一個數據類型.
本文的全部演示, 都是基於Elasticsearch 6.6.0進行的, 不一樣的版本可能存在API發生修改、不支持的狀況, 還請注意.

1 核心數據類型

1.1 字符串類型 - string(再也不支持)

(1) 使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "title": {"type": "string"},    // 全文本
                "tags": {"type": "string", "index": "not_analyzed"} // 關鍵字, 不分詞
            }
        }
    }
}

(2) ES 5.6.10中的響應信息:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [tags]
#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "website"
}

(3) ES 6.6.0中的響應信息:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [string] declared on field [title]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "Failed to parse mapping [blog]: No handler for type [string] declared on field [title]",
    "caused_by": {
      "type": "mapper_parsing_exception",
      "reason": "No handler for type [string] declared on field [title]"
    }
  },
  "status": 400
}

可知string類型的field已經被移除了, 咱們須要用text或keyword類型來代替string.

1.1.1 文本類型 - text

在Elasticsearch 5.4 版本開始, text取代了須要分詞的string.

—— 當一個字段須要用於全文搜索(會被分詞), 好比產品名稱、產品描述信息, 就應該使用text類型.

text的內容會被分詞, 能夠設置是否須要存儲: "index": "true|false".
text類型的字段不能用於排序, 也不多用於聚合.

使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "summary": {"type": "text", "index": "true"}
            }
        }
    }
}

1.1.2 關鍵字類型 - keyword

在Elasticsearch 5.4 版本開始, keyword取代了不須要分詞的string.

—— 當一個字段須要按照精確值進行過濾、排序、聚合等操做時, 就應該使用keyword類型.

keyword的內容不會被分詞, 能夠設置是否須要存儲: "index": "true|false".

使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "tags": {"type": "keyword", "index": "true"}
            }
        }
    }
}

1.2 數字類型 - 8種

數字類型有以下分類:

類型	說明
byte	有符號的8位整數, 範圍: [-128 ~ 127]
short	有符號的16位整數, 範圍: [-32768 ~ 32767]
integer	有符號的32位整數, 範圍: [$-2^{31}$ ~ $2^{31}$-1]
long	有符號的32位整數, 範圍: [$-2^{63}$ ~ $2^{63}$-1]
float	32位單精度浮點數
double	64位雙精度浮點數
half_float	16位半精度IEEE 754浮點類型
scaled_float	縮放類型的的浮點數, 好比price字段只需精確到分, 57.34縮放因子爲100, 存儲結果爲5734

使用注意事項:

儘量選擇範圍小的數據類型, 字段的長度越短, 索引和搜索的效率越高;
優先考慮使用帶縮放因子的浮點類型.

使用示例:

PUT shop
{
    "mappings": {
        "book": {
            "properties": {
                "name": {"type": "text"},
                "quantity": {"type": "integer"},  // integer類型
                "price": {
                    "type": "scaled_float",       // scaled_float類型
                    "scaling_factor": 100
                }
            }
        }
    }
}

1.3 日期類型 - date

JSON沒有日期數據類型, 因此在ES中, 日期能夠是:

包含格式化日期的字符串, "2018-10-01", 或"2018/10/01 12:10:30".
表明時間毫秒數的長整型數字.
表明時間秒數的整數.

若是時區未指定, 日期將被轉換爲UTC格式, 但存儲的倒是長整型的毫秒值.
能夠自定義日期格式, 若未指定, 則使用默認格式: strict_date_optional_time||epoch_millis

(1) 使用日期格式示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "pub_date": {"type": "date"}   // 日期類型
            }
        }
    }
}

// 添加數據
PUT website/blog/11
{ "pub_date": "2018-10-10" }

PUT website/blog/12
{ "pub_date": "2018-10-10T12:00:00Z" }  // Solr中默認使用的日期格式

PUT website/blog/13
{ "pub_date": "1589584930103" }         // 時間的毫秒值

(2) 多種日期格式:

多個格式使用雙豎線||分隔, 每一個格式都會被依次嘗試, 直到找到匹配的.
第一個格式用於將時間毫秒值轉換爲對應格式的字符串.

使用示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "date": {
                    "type": "date",  // 能夠接受以下類型的格式
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                }
            }
        }
    }
}

1.4 布爾類型 - boolean

能夠接受表示真、假的字符串或數字:

真值: true, "true", "on", "yes", "1"...
假值: false, "false", "off", "no", "0", ""(空字符串), 0.0, 0

1.5 二進制型 - binary

二進制類型是Base64編碼字符串的二進制值, 不以默認的方式存儲, 且不能被搜索.

使用示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "blob": {"type": "binary"}   // 二進制
            }
        }
    }
}
// 添加數據
PUT website/blog/1
{
    "title": "Some binary blog",
    "blob": "hED903KSrA084fRiD5JLgY=="
}

注意: Base64編碼的二進制值不能嵌入換行符\n.

1.6 範圍類型 - range

range類型支持如下幾種:

類型	範圍
integer_range	$-2^{31}$ ~ $2^{31}-1$
long_range	$-2^{63}$ ~ $2^{63}-1$
float_range	32位單精度浮點型
double_range	64位雙精度浮點型
date_range	64位整數, 毫秒計時
ip_range	IP值的範圍, 支持IPV4和IPV6, 或者這兩種同時存在

(1) 添加映射:

PUT company
{
    "mappings": {
        "department": {
            "properties": {
                "expected_number": {  // 預期員工數
                    "type": "integer_range"
                },
                "time_frame": {       // 發展時間線
                    "type": "date_range", 
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                },
                "ip_whitelist": {     // ip白名單
                    "type": "ip_range"
                }
            }
        }
    }
}

(2) 添加數據:

PUT company/department/1
{
    "expected_number" : {
        "gte" : 10,
        "lte" : 20
    },
    "time_frame" : { 
        "gte" : "2018-10-01 12:00:00", 
        "lte" : "2018-11-01"
    }, 
    "ip_whitelist": "192.168.0.0/16"
}

(3) 查詢數據:

GET company/department/_search
{
    "query": {
        "term": {
            "expected_number": {
                "value": 12
            }
        }
    }
}
GET company/department/_search
{
    "query": {
        "range": {
            "time_frame": {
                "gte": "208-08-01",
                "lte": "2018-12-01",
                "relation": "within" 
            }
        }
    }
}

查詢結果：

{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "company",
        "_type": "department",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "expected_number": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2018-10-01 12:00:00",
            "lte": "2018-11-01"
          },
          "ip_whitelist" : "192.168.0.0/16"
        }
      }
    ]
  }
}

2 複雜數據類型

2.1 數組類型 - array

ES中沒有專門的數組類型, 直接使用[]定義便可;

數組中全部的值必須是同一種數據類型, 不支持混合數據類型的數組:

① 字符串數組: ["one", "two"];
② 整數數組: [1, 2];
③ 由數組組成的數組: [1, [2, 3]], 等價於[1, 2, 3];
④ 對象數組: [{"name": "Tom", "age": 20}, {"name": "Jerry", "age": 18}].

注意:

動態添加數據時, 數組中第一個值的類型決定整個數組的類型;

不支持混合數組類型, 好比[1, "abc"];

數組能夠包含null值, 空數組[]會被當作missing field —— 沒有值的字段.

2.2 對象類型 - object

JSON文檔是分層的: 文檔能夠包含內部對象, 內部對象也能夠包含內部對象.

(1) 添加示例:

PUT employee/developer/1
{
    "name": "ma_shoufeng",
    "address": {
        "region": "China",
        "location": {"province": "GuangDong", "city": "GuangZhou"}
    }
}

(2) 存儲方式:

{
    "name":                       "ma_shoufeng",
    "address.region":             "China",
    "address.location.province":  "GuangDong", 
    "address.location.city":      "GuangZhou"
}

(3) 文檔的映射結構相似爲:

PUT employee
{
    "mappings": {
        "developer": {
            "properties": {
                "name": { "type": "text", "index": "true" }, 
                "address": {
                    "properties": {
                        "region": { "type": "keyword", "index": "true" },
                        "location": {
                            "properties": {
                                "province": { "type": "keyword", "index": "true" },
                                "city": { "type": "keyword", "index": "true" }
                            }
                        }
                    }
                }
            }
        }
    }
}

2.3 嵌套類型 - nested

嵌套類型是對象數據類型的一個特例, 可讓array類型的對象被獨立索引和搜索.

2.3.1 對象數組是如何存儲的

① 添加數據:

PUT game_of_thrones/role/1
{
    "group": "stark",
    "performer": [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}

② 內部存儲結構:

{
    "group":             "stark",
    "performer.first": [ "john", "sansa" ],
    "performer.last":  [ "snow", "stark" ]
}

③ 存儲分析:

能夠看出, user.first和user.last會被平鋪爲多值字段, 這樣一來, John和Snow之間的關聯性就丟失了.

在查詢時, 可能出現John Stark的結果.

2.3.2 用nested類型解決object類型的不足

若是須要對以最對象進行索引, 且保留數組中每一個對象的獨立性, 就應該使用嵌套數據類型.

—— 嵌套對象實質是將每一個對象分離出來, 做爲隱藏文檔進行索引.

① 建立映射:

PUT game_of_thrones
{
    "mappings": {
        "role": {
            "properties": {
                "performer": {"type": "nested" }
            }
        }
    }
}

② 添加數據:

PUT game_of_thrones/role/1
{
    "group" : "stark",
    "performer" : [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}

③ 檢索數據:

GET game_of_thrones/_search
{
    "query": {
        "nested": {
            "path": "performer",
            "query": {
                "bool": {
                    "must": [
                        { "match": { "performer.first": "John" }},
                        { "match": { "performer.last":  "Snow" }} 
                    ]
                }
            }, 
            "inner_hits": {
                "highlight": {
                    "fields": {"performer.first": {}}
                }
            }
        }
    }
}

3 地理數據類型

3.1 地理點類型 - geo point

地理點類型用於存儲地理位置的經緯度對, 可用於:

查找必定範圍內的地理點;

經過地理位置或相對某個中心點的距離聚合文檔;

將距離整合到文檔的相關性評分中;

經過距離對文檔進行排序.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "developer": {
            "properties": {
                "location": {"type": "geo_point"}
            }
        }
    }
}

(2) 存儲地理位置:

// 方式一: 緯度 + 經度鍵值對
PUT employee/developer/1
{
    "text": "小蠻腰-鍵值對地理點參數", 
    "location": {
        "lat": 23.11, "lon": 113.33     // 緯度: latitude, 經度: longitude
    }
}

// 方式二: "緯度, 經度"的字符串參數
PUT employee/developer/2
{
  "text": "小蠻腰-字符串地理點參數",
  "location": "23.11, 113.33"           // 緯度, 經度
}

// 方式三: ["經度, 緯度"] 數組地理點參數
PUT employee/developer/3
{
  "text": "小蠻腰-數組參數",
  "location": [ 113.33, 23.11 ]         // 經度, 緯度
}

(3) 查詢示例:

GET employee/_search
{
    "query": { 
        "geo_bounding_box": { 
            "location": {
                "top_left": { "lat": 24, "lon": 113 },      // 地理盒子模型的上-左邊
                "bottom_right": { "lat": 22, "lon": 114 }   // 地理盒子模型的下-右邊
            }
        }
    }
}

3.2 地理形狀類型 - geo_shape

是多邊形的複雜形狀. 使用較少, 這裏省略.

能夠參考這篇文章: Elasticsearch地理位置總結

4 專門數據類型

4.1 IP類型

IP類型的字段用於存儲IPv4或IPv6的地址, 本質上是一個長整型字段.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "customer": {
            "properties": {
                "ip_addr": { "type": "ip" }
            }
        }
    }
}

(2) 添加數據:

PUT employee/customer/1
{ "ip_addr": "192.168.1.1" }

(3) 查詢數據:

GET employee/customer/_search
{
    "query": {
        "term": { "ip_addr": "192.168.0.0/16" }
    }
}

4.2 計數數據類型 - token_count

token_count類型用於統計字符串中的單詞數量.

本質上是一個整數型字段, 接受並分析字符串值, 而後索引字符串中單詞的個數.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "customer": {
            "properties": {
                "name": { 
                    "type": "text",
                    "fields": {
                        "length": {
                            "type": "token_count", 
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}

(2) 添加數據:

PUT employee/customer/1
{ "name": "John Snow" }
PUT employee/customer/2
{ "name": "Tyrion Lannister" }

(3) 查詢數據:

GET employee/customer/_search
{
    "query": {
        "term": { "name.length": 2 }
    }
}

參考資料

Elasticsearch 6.6 官方文檔 - Field datatypes

Elasticsearch 5.4 Mapping詳解

做者：yongfutian 連接：https://www.jianshu.com/p/01f489c46c38 來源：簡書簡書著做權歸做者全部，任何形式的轉載都請聯繫做者得到受權並註明出處。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。