Elasticsearch 索引建立 / 數據檢索

時間 2019-12-05

原文原文鏈接

es 6.0 開始不推薦一個index下多個type的模式，而且會在 7.0 中徹底移除。在 6.0 的index下是沒法建立多個type的，type帶來的字段類型衝突和檢索效率降低的問題，致使了type會被移除。（5.x到6.x）
_all字段也被捨棄了，使用 copy_to自定義聯合字段。（5.x到6.x）
type:text/keyword 來決定是否分詞，index: true/false決定是否索引（2.x到5.x）
analyzer來單獨設定分詞器（2.x到5.x）

建立索引

先把 ik 裝上，重啓服務。html

# 使用 elasticsearch-plugin 安裝
elasticsearch-plugin install \
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

文檔字段類型參考：
https://www.elastic.co/guide/...git

文檔字段其餘參數參考（不一樣字段類型可能會有相應的特徵屬性）：
https://www.elastic.co/guide/...github

咱們新建一個名news的索引：正則表達式

設定默認分詞器爲ik分詞器用來處理中文
使用默認名 _doc 定義 type
故意關閉_source存儲（用來驗證 store 選項）
title 不存儲 author 不分詞 content 存儲shell

_source字段的含義能夠看下這篇博文：https://blog.csdn.net/napoay/...app

PUT /news
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "index": {
            "analysis.analyzer.default.type" : "ik_smart"
        }
    },
    "mappings": {
        "_doc": {
            "_source": {
                "enabled": false
            },
            "properties": {
                "news_id": {
                    "type": "integer",
                    "index": true
                },
                "title": {
                    "type": "text",
                    "store": false
                },
                "author": {
                    "type": "keyword"
                },
                "content": {
                    "type": "text",
                    "store": true
                },
                "created_at": {
                    "type": "date",
                    "format": "yyyy-MM-dd hh:mm:ss"
                }
            }
        }
    }
}
# 查看建立的結構
GET /news/_mapping

驗證分詞器是否生效elasticsearch

# 驗證分詞插件是否生效
GET /_analyze
{
    "analyzer": "ik_smart",
    "text": "我熱愛祖國"
}
GET /_analyze
{
    "analyzer": "ik_max_word",
    "text": "我熱愛祖國"
}

# 索引的默認分詞器
GET /news/_analyze
{
    "text": "我熱愛祖國！"
}

# 指定字段 分詞器將根據字段屬性作相應分詞處理
# author 爲 keyword 是不會作分詞處理
GET /news/_analyze
{
    "field": "author"
    "text": "我熱愛祖國！"
}
# title 的分詞結果
GET /news/_analyze
{
    "field": "title"
    "text": "我熱愛祖國！"
}

添加文檔

用於演示，後面的查詢會以這些文檔爲例。ide

POST /news/_doc
{
    "news_id": 1,
    "title": "咱們一塊兒學旺叫",
    "author": "才華橫溢王大貓",
    "content": "咱們一塊兒學旺叫，一塊兒旺旺旺旺旺，在你面撒個嬌，哎呦旺旺旺旺旺，個人尾巴可勁兒搖",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 2,
    "title": "咱們一塊兒學貓叫",
    "author": "王大貓不會被分詞",
    "content": "咱們一塊兒學貓叫，仍是旺旺旺旺旺，在你面撒個嬌，哎呦旺旺旺旺旺，個人尾巴可勁兒搖",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 3,
    "title": "實在編不出來了",
    "author": "王大貓",
    "content": "實在編不出來了，隨便寫點數據作測試吧，旺旺旺",
    "created_at": "2019-03-26 11:55:20"
}

檢索數據

GET /news/_doc/_search 爲查詢news下_doc的文檔的接口，咱們用 restApi+DSL演示測試

match_all

即無檢索條件獲取所有數據ui

#無條件分頁檢索 以 news_id 排序
GET /news/_doc/_search
{
    "query": {
        "match_all": {}
    },
    "from": 0,
    "size": 2,
    "sort": {
        "news_id": "desc"
    }
}

由於咱們關掉了_source字段，即 ES 只會對數據創建倒排索引，不會存儲其原數據，因此結果裏沒有相關文檔原數據內容。關掉的緣由主要是想演示highlight機制。

match

普通檢索，不少文章都說match查詢會對查詢內容進行分詞，其實並不徹底正確，match查詢也要看檢索的字段type類型，若是字段類型自己就是不分詞的keyword(not_analyzed)，那match就等同於term查詢了。

咱們能夠經過分詞器explain一下字段會被如何處理:

GET /news/_analyze
{
    "filed": "title",
    "text": "我會被如何處理呢？分詞？不分詞？"
}

查詢

GET /news/_doc/_search
{
    "query": {
        "match": {
            "title": "咱們會被分詞"
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

經過highlight咱們能夠將檢索到的關鍵詞以高亮的方式返回上下文內容，若是關閉了_source就得開啓字段的store屬性存儲字段的原數據，這樣才能作高亮處理，否則沒有原內容了，也就沒辦法高亮關鍵詞了

multi_match

對多個字段進行檢索，好比我想查詢title或content中有咱們關鍵詞的文檔，以下便可：

GET /news/_doc/_search
{
    "query": {
        "multi_match": {
            "query": "咱們是好人",
            "fields": ["title", "content"]
        }
    },
    "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

match_phrase

這個須要認證理解一下，match_phrase，短語查詢，何爲短語查詢呢？簡單來講即被查詢的文檔字段中要包含查詢內容被分詞解析後的全部關鍵詞，且關鍵詞在文檔中的分佈距離差offset要知足slop設定的閾值。slop表徵能夠將關鍵詞平移幾回來知足在文檔中的分佈，若是slop足夠的大，那麼即使全部關鍵詞在文檔中分佈的很離散，也是能夠經過平移知足的。

content: i love china
match_phrase: i china
slop: 0//查不到 須要將 i china 的 china 關鍵詞 slop 1 後變爲 i - china 才能知足
slop: 1//查獲得

測試實例

# 先看下查詢會被如何解析分詞
GET /news/_analyze
{
    "field": "title",
    "text": "咱們學"
}
# reponse
{
    "tokens": [
        {
            "token": "咱們",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "學",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        }
    ]
}

# 再看下某文檔的title是被怎樣創建倒排索引的
GET /news/_analyze
{
    "field": "title",
    "text": "咱們一塊兒學旺叫"
}
# reponse
{
    "tokens": [
        {
            "token": "咱們",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "一塊兒",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "學",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 2
        },
        ...
    ]
}

注意position字段，只有slop的閾值大於兩個不相鄰的關鍵詞的position差時，才能知足平移關鍵詞至查詢內容短語分佈的位置條件。

查詢內容被分詞爲：["咱們", "學"]，而文檔中["咱們", "學"]兩個關鍵字的距離爲 1，因此，slop必須大於等於1，此文檔才能被查詢到。

使用查詢短語模式：

GET /news/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "咱們學",
                "slop": 1
            }
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

查詢結果：

{
            ...
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "if-CuGkBddO9SrfVBoil",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>咱們</em>一塊兒<em>學</em>貓叫"
                    ]
                }
            },
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "iP-AuGkBddO9SrfVOIg3",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>咱們</em>一塊兒<em>學</em>旺叫"
                    ]
                }
            }
            ...
}

term

term要理解只是不對查詢條件分詞，做爲一個關鍵詞去檢索索引。但文檔存儲時字段是否被分詞創建索引由_mappings時設定了。可能有["咱們", "一塊兒"]兩個索引，但並無["咱們一塊兒"]這個索引，查詢不到。keyword類型的字段則存儲時不分詞，創建完整索引，查詢時也不會對查詢條件分詞，是強一致性的。

GET /news/_doc/_search
{
    "query": {
        "term": {
           "title": "咱們一塊兒" 
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

terms

terms則是給定多個關鍵詞，就比如人工分詞

{
    "query": {
        "terms": {
           "title": ["咱們", "一塊兒"]
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

知足["咱們", "一塊兒"]任意關鍵字的文檔都能被檢索到。

wildcard

shell通配符查詢: ? 一個字符 * 多個字符，查詢倒排索引中符合pattern的關鍵詞。

查詢有兩個字符的關鍵詞的文檔

{
   "query": {
       "wildcard": {
               "title": "??"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

prefix

前綴查詢，查詢倒排索引中符合pattern的關鍵詞。

{
   "query": {
       "prefix": {
               "title": "我"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

regexp

正則表達式查詢，查詢倒排索引中符合pattern的關鍵詞。

查詢含有2 ~ 3 個字符的關鍵詞的文檔

{
   "query": {
       "regexp": {
               "title": ".{2,3}"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

bool

布爾查詢經過 bool連接多個查詢組合：
must：必須全知足
must_not：必須全不知足
should：知足一個便可

{
   "query": {
        "bool": {
            "must": {
                "match": {
                    "title": "絕對要有咱們"
                }
            },
            "must_not": {
                "term": {
                    "title": "絕對不能有我"
                }
            },
            "should": [
                {
                    "match": {
                        "content": "咱們"
                    }
                },
                {
                    "multi_match": {
                        "query": "知足",
                        "fields": ["title", "content"]
                    }
                },
                {
                    "match_phrase": {
                        "title": "一個便可"
                    }
                }
            ],
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2019-01-05 12:00:00"
                    }
                }
            }
        }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

filter

filter 一般狀況下會配合match之類的使用，對符合查詢條件的數據進行過濾。

{
   "query": {
        "bool": {
            "must": {
                "match_all": {}
            },
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
        }
   }
}

或者單獨使用

{
   "query": {
       "constant_score" : {
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
       }
   }
}

多個過濾條件：2017-12-05 12:00:00 <= created_at < 2020-12-05 12:00:00 and news_id >= 2

{
   "query": {
       "constant_score" : {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "created_at": {
                                    "lt": "2020-12-05 12:00:00",
                                    "gt": "2017-12-05 12:00:00"
                                }
                            }
                        },
                        {
                            "range": {
                                "news_id": {
                                    "gte": 2
                                }
                            }
                        }
                    ]
                }
            }
       }
   }
}