Elasticsearch Index查詢優化及Mapping分詞深刻剖析-搜索系統線上實戰

時間 2019-11-24

標籤 elasticsearch index 查詢優化 mapping 分詞深刻剖析搜索系統線上實戰欄目日誌分析简体版

原文原文鏈接

本套技術專欄做者（秦凱新）專一於大數據及容器雲核心技術解密，具有5年工業級IOT大數據雲平臺建設經驗，可提供全棧的大數據+雲原平生臺諮詢方案，請持續關注本套博客。QQ郵箱地址：1120746959@qq.com，若有任何學術交流，可隨時聯繫。html

1 multi-index和multi-type搜索模式

/_search：全部索引，全部type下的全部數據都搜索出來
    /index1/_search：指定一個index，搜索其下全部type的數據
    /index1,index2/_search：同時搜索兩個index下的數據
    /*1,*2/_search：按照通配符去匹配多個索引
    /index1/type1/_search：搜索一個index下指定的type的數據
    /index1/type1,type2/_search：能夠搜索一個index下多個type的數據
    /index1,index2/type1,type2/_search：搜索多個index下的多個type的數據
    /_all/type1,type2/_search：_all，能夠表明搜索全部index下的指定type的數據
複製代碼

2 分頁搜索（防止Deep Paging）

size 表示頁大小，from表示從第幾個document開始查詢

GET /_search?size=10
        GET /_search?size=10&from=0
        GET /_search?size=10&from=20



        GET /test_index/test_type/_search
        
        "hits": {
            "total": 9,
            "max_score": 1,
複製代碼

咱們假設將這9條數據分紅3頁，每一頁是3條數據，來實驗一下這個分頁搜索的效果

GET /test_index/test_type/_search?from=0&size=3
        
        {
          "took": 2,
          "timed_out": false,
          "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
          },
          "hits": {
            "total": 9,
            "max_score": 1,
            "hits": [
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "8",
                "_score": 1,
                "_source": {
                  "test_field": "test client 2"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "6",
                "_score": 1,
                "_source": {
                  "test_field": "tes test"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "4",
                "_score": 1,
                "_source": {
                  "test_field": "test4"
                }
              }
            ]
          }
        }
        
        第一頁：id=8,6,4
        
        GET /test_index/test_type/_search?from=3&size=3
        
        第二頁：id=2,自動生成,7
        
        GET /test_index/test_type/_search?from=6&size=3
        
        第三頁：id=1,11,3
複製代碼

3 mapping原理

3.1 dynamic mapping初體驗

自動或手動爲index中的type創建的一種數據結構和相關配置，簡稱爲mapping。web
dynamic mapping，自動爲咱們創建index，建立type，以及type對應的mapping，mapping中包含了每一個field對應的數據類型，以及如何分詞等設置咱們固然，後面會講解，也能夠手動在建立數據以前，先建立index和type，以及type對應的mapping。bash
插入幾條數據，讓es自動爲咱們創建一個索引數據結構

PUT /website/article/1
        {
          "post_date": "2017-01-01",
          "title": "my first article",
          "content": "this is my first article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/2
        {
          "post_date": "2017-01-02",
          "title": "my second article",
          "content": "this is my second article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/3
        {
          "post_date": "2017-01-03",
          "title": "my third article",
          "content": "this is my third article in this website",
          "author_id": 11400
        }
複製代碼

嘗試各類搜索（發現不能作到精確匹配）

GET /website/article/_search?q=2017			        3條結果             
        GET /website/article/_search?q=2017-01-01        	        3條結果
        GET /website/article/_search?q=post_date:2017-01-01   	1條結果
        GET /website/article/_search?q=post_date:2017         	1條結果
複製代碼

查看mapping，搜索結果爲何不一致，由於es自動創建mapping的時候，設置了不一樣的field不一樣的data type。不一樣的data type的分詞、搜索等行爲是不同的。因此出現了_all field和post_date field的搜索表現徹底不同。

GET /website/_mapping/article
        
        {
          "website": {
            "mappings": {
              "article": {
                "properties": {
                  "author_id": {
                    "type": "long"
                  },
                  "content": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "post_date": {
                    "type": "date"
                  },
                  "title": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
複製代碼

3.2 dynamic mapping揭祕

分詞，初步的倒排索引的創建

doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

複製代碼

執行搜索

mother like little dog，
    不可能有任何結果。
    這個是否是咱們想要的搜索結果？？？絕對不是，由於在咱們看來，mother和mom有區別嗎？同義詞，都是媽媽的意思。
    like和liked有區別嗎？沒有，都是喜歡的意思，只不過一個是如今時，一個是過去時。little和small有區別嗎？
    同義詞，都是小小的。dog和dogs有區別嗎？狗，只不過一個是單數，一個是複數。
複製代碼

normalization，創建倒排索引的時候，會執行一個操做，也就是說對拆分出的各個單詞進行相應的處理，以提高後面搜索的時候可以搜索到相關聯的文檔的機率

時態的轉換，單複數的轉換，同義詞的轉換，大小寫的轉換
        mom —> mother
        liked —> like
        small —> little
        dogs —> dog
複製代碼

從新創建倒排索引，加入normalization，再次用mother liked little dog搜索，就能夠搜索到了app

3.3 什麼是分詞器

切分詞語，normalization（提高recall召回率），例如：給你一段句子，而後將這段句子拆分紅一個一個的單個的單詞，同時對每一個單詞進行normalization（時態轉換，單複數轉換），分瓷器 recall，召回率：搜索的時候，增長可以搜索到的結果的數量

character filter：在一段文本進行分詞以前，先進行預處理，好比說最多見的就是，過濾
    html標籤（<span>hello<span> --> hello），& --> and（I&you --> I and you）
    
    tokenizer：分詞，hello you and me --> hello, you, and, me
    
    token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 幹掉，mother --> mom，small --> little
複製代碼

一個分詞器，很重要，將一段文本進行各類處理，最後處理好的結果纔會拿去創建倒排索引elasticsearch
內置分詞器的介紹post

Set the shape to semi-transparent by calling set_trans(5)
    
    standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard）
    
    simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
    
    whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    
    language analyzer（特定的語言的分詞器，好比說，english，英語分詞器）：set, shape, semi, transpar, call, set_tran, 5
複製代碼

3.4 遺留問題揭祕

query string分詞

query string必須以和index創建時相同的analyzer進行分詞
    query string對exact value和full text的區別對待
    
     知識點：不一樣類型的field，可能有的就是full text，有的就是exact value
    
    post_date，date：exact value
    _all：full text，分詞，normalization
    
    GET /_search?q=2017

    搜索的是_all field，document全部的field都會拼接成一個大串，進行分詞
    
    2017-01-02 my second article this is my second article in this website 11400
    
    		doc1		doc2		doc3
    2017		*		*		    *
    01		* 		
    02				*
    03						    *
    
    _all，2017，天然會搜索到3個docuemnt
    
    GET /_search?q=2017-01-01
    
    _all，2017-01-01，query string會用跟創建倒排索引同樣的分詞器去進行分詞
    
    2017
    01
    01
    
    GET /_search?q=post_date:2017-01-01

    date，會做爲exact value去創建索引
    
    		doc1		doc2		doc3
    2017-01-01	*		
    2017-01-02			* 		
    2017-01-03					*
    
    post_date:2017-01-01，2017-01-01，doc1一條document
    
    GET /_search?q=post_date:2017，這個在這裏不講解，由於是es 5.2之後作的一個優化
    
    
複製代碼

測試分詞器

GET /_analyze
    {
      "analyzer": "standard",
      "text": "Text to analyze"
    }
複製代碼

3.5 Mapping原理

往es裏面直接插入數據，es會自動創建索引，同時創建type以及對應的mapping測試
mapping中就自動定義了每一個field的數據類型大數據
不一樣的數據類型（好比說text和date），可能有的是exact value，有的是full text優化
exact value，在創建倒排索引的時候，分詞的時候，是將整個值一塊兒做爲一個關鍵詞創建到倒排索引中的；full text，會經歷各類各樣的處理，分詞，normaliztion（時態轉換，同義詞轉換，大小寫轉換），纔會創建到倒排索引中
同時呢，exact value和full text類型的field就決定了，在一個搜索過來的時候，對exact value field或者是full text field進行搜索的行爲也是不同的，會跟創建倒排索引的行爲保持一致；好比說exact value搜索的時候，就是直接按照整個值進行匹配，full text query string，也會進行分詞和normalization再去倒排索引中去搜索
能夠用es的dynamic mapping，讓其自動創建mapping，包括自動設置數據類型；也能夠提早手動建立index和type的mapping，本身對各個field進行設置，包括數據類型，包括索引行爲，包括分詞器，等等
mapping，就是index的type的元數據，每一個type都有一個本身的mapping，決定了數據類型，創建倒排索引的行爲，還有進行搜索的行爲

3.6 Mapping數據結構

核心的數據類型

string
    byte，short，integer，long
    float，double
    boolean
    date
複製代碼

dynamic mapping

true or false	-->	boolean
    123		-->	long
    123.45		-->	double
    2017-01-01	-->	date
    "hello world"	-->	string/text
複製代碼

查看mapping

GET /index/_mapping/type
複製代碼

3.7 Mapping手動創建索引

如何創建索引

analyzed
    not_analyzed
    no
複製代碼

修改mapping

PUT /website
    {
      "mappings": {
        "article": {
          "properties": {
            "author_id": {
              "type": "long"
            },
            "title": {
              "type": "text",
              "analyzer": "english"
            },
            "content": {
              "type": "text"
            },
            "post_date": {
              "type": "date"
            },
            "publisher_id": {
              "type": "text",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
複製代碼

以下修改索引，是會報錯的

PUT /website
        {
          "mappings": {
            "article": {
              "properties": {
                "author_id": {
                  "type": "text"
                }
              }
            }
          }
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "index_already_exists_exception",
                "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
                "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
                "index": "website"
              }
            ],
            "type": "index_already_exists_exception",
            "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
            "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
            "index": "website"
          },
          "status": 400
        }
複製代碼

修改索引正確方法

PUT /website/_mapping/article
        {
          "properties" : {
            "new_field" : {
              "type" :    "string",
              "index":    "not_analyzed"
            }
          }
        }
複製代碼

測試mapping

GET /website/_analyze
        {
          "field": "content",
          "text": "my-dogs" 
        }
        
        GET website/_analyze
        {
          "field": "new_field",
          "text": "my dogs"
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "remote_transport_exception",
                "reason": "[4onsTYV][127.0.0.1:9300][indices:admin/analyze[s]]"
              }
            ],
            "type": "illegal_argument_exception",
            "reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
          },
          "status": 400
        }
複製代碼

multivalue field，創建索引時與string是同樣的，數據類型不能混

{ "tags": [ "tag1", "tag2" ]}
複製代碼

empty field

null，[]，[null]
複製代碼

object field

PUT /company/employee/1
        {
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        address：object類型
        
        GET /company/_mapping/employee
        
        {
          "company": {
            "mappings": {
              "employee": {
                "properties": {
                  "address": {
                    "properties": {
                      "city": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "country": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "province": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "age": {
                    "type": "long"
                  },
                  "join_date": {
                    "type": "date"
                  },
                  "name": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
        
複製代碼

ES底層數據結構1

{
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        {
            "name":            [jack],
            "age":          [27],
            "join_date":      [2017-01-01],
            "address.country":         [china],
            "address.province":   [guangdong],
            "address.city":  [guangzhou]
        }
複製代碼

ES底層數據結構2

{
            "authors": [
                { "age": 26, "name": "Jack White"},
                { "age": 55, "name": "Tom Jones"},
                { "age": 39, "name": "Kitty Smith"}
            ]
        }
        
        {
            "authors.age":    [26, 55, 39],
            "authors.name":   [jack, white, tom, jones, kitty, smith]
        }
複製代碼

4 Query DSL 分析

Query DSL使用

GET /_search
        {
            "query": {
                "match_all": {}
            }
        }
複製代碼

Query DSL的基本語法

{
            QUERY_NAME: {
                ARGUMENT: VALUE,
                ARGUMENT: VALUE,...
            }
        }
        
        {
            QUERY_NAME: {
                FIELD_NAME: {
                    ARGUMENT: VALUE,
                    ARGUMENT: VALUE,...
                }
            }
        }
        
        GET /test_index/test_type/_search 
        {
          "query": {
            "match": {
              "test_field": "test"
            }
          }
        }
複製代碼

如何組合多個搜索條件，搜索需求：title必須包含elasticsearch，content能夠包含elasticsearch也能夠不包含，author_id必須不爲111

GET /website/article/_search
        {
          "query": {
            "bool": {
              "must": [
                {
                  "match": {
                    "title": "elasticsearch"
                  }
                }
              ],
              "should": [
                {
                  "match": {
                    "content": "elasticsearch"
                  }
                }
              ],
              "must_not": [
                {
                  "match": {
                    "author_id": 111
                  }
                }
              ]
            }
          }
        }
        
        GET /test_index/_search
        {
            "query": {
                    "bool": {
                        "must": { "match":   { "name": "tom" }},
                        "should": [
                            { "match":       { "hired": true }},
                            { "bool": {
                                "must":      { "match": { "personality": "good" }},
                                "must_not":  { "match": { "rude": true }}
                            }}
                        ],
                        "minimum_should_match": 1
                    }
            }
複製代碼