Elasticsearch Index查詢優化及Mapping分詞深刻剖析-搜索系統線上實戰

本套技術專欄做者(秦凱新)專一於大數據及容器雲核心技術解密,具有5年工業級IOT大數據雲平臺建設經驗,可提供全棧的大數據+雲原平生臺諮詢方案,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。html

1 multi-index和multi-type搜索模式

/_search:全部索引,全部type下的全部數據都搜索出來
    /index1/_search:指定一個index,搜索其下全部type的數據
    /index1,index2/_search:同時搜索兩個index下的數據
    /*1,*2/_search:按照通配符去匹配多個索引
    /index1/type1/_search:搜索一個index下指定的type的數據
    /index1/type1,type2/_search:能夠搜索一個index下多個type的數據
    /index1,index2/type1,type2/_search:搜索多個index下的多個type的數據
    /_all/type1,type2/_search:_all,能夠表明搜索全部index下的指定type的數據
複製代碼

2 分頁搜索(防止Deep Paging)

  • size 表示頁大小,from表示從第幾個document開始查詢
GET /_search?size=10
        GET /_search?size=10&from=0
        GET /_search?size=10&from=20



        GET /test_index/test_type/_search
        
        "hits": {
            "total": 9,
            "max_score": 1,
複製代碼
  • 咱們假設將這9條數據分紅3頁,每一頁是3條數據,來實驗一下這個分頁搜索的效果
GET /test_index/test_type/_search?from=0&size=3
        
        {
          "took": 2,
          "timed_out": false,
          "_shards": {
            "total": 5,
            "successful": 5,
            "failed": 0
          },
          "hits": {
            "total": 9,
            "max_score": 1,
            "hits": [
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "8",
                "_score": 1,
                "_source": {
                  "test_field": "test client 2"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "6",
                "_score": 1,
                "_source": {
                  "test_field": "tes test"
                }
              },
              {
                "_index": "test_index",
                "_type": "test_type",
                "_id": "4",
                "_score": 1,
                "_source": {
                  "test_field": "test4"
                }
              }
            ]
          }
        }
        
        第一頁:id=8,6,4
        
        GET /test_index/test_type/_search?from=3&size=3
        
        第二頁:id=2,自動生成,7
        
        GET /test_index/test_type/_search?from=6&size=3
        
        第三頁:id=1,11,3
複製代碼

3 mapping原理

3.1 dynamic mapping初體驗

  • 自動或手動爲index中的type創建的一種數據結構和相關配置,簡稱爲mapping。web

  • dynamic mapping,自動爲咱們創建index,建立type,以及type對應的mapping,mapping中包含了每一個field對應的數據類型,以及如何分詞等設置 咱們固然,後面會講解,也能夠手動在建立數據以前,先建立index和type,以及type對應的mapping。bash

  • 插入幾條數據,讓es自動爲咱們創建一個索引數據結構

PUT /website/article/1
        {
          "post_date": "2017-01-01",
          "title": "my first article",
          "content": "this is my first article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/2
        {
          "post_date": "2017-01-02",
          "title": "my second article",
          "content": "this is my second article in this website",
          "author_id": 11400
        }
        
        PUT /website/article/3
        {
          "post_date": "2017-01-03",
          "title": "my third article",
          "content": "this is my third article in this website",
          "author_id": 11400
        }
複製代碼
  • 嘗試各類搜索(發現不能作到精確匹配)
GET /website/article/_search?q=2017			        3條結果             
        GET /website/article/_search?q=2017-01-01        	        3條結果
        GET /website/article/_search?q=post_date:2017-01-01   	1條結果
        GET /website/article/_search?q=post_date:2017         	1條結果
複製代碼
  • 查看mapping,搜索結果爲何不一致,由於es自動創建mapping的時候,設置了不一樣的field不一樣的data type。不一樣的data type的分詞、搜索等行爲是不同的。因此出現了_all field和post_date field的搜索表現徹底不同。
GET /website/_mapping/article
        
        {
          "website": {
            "mappings": {
              "article": {
                "properties": {
                  "author_id": {
                    "type": "long"
                  },
                  "content": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "post_date": {
                    "type": "date"
                  },
                  "title": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
複製代碼

3.2 dynamic mapping揭祕

  • 分詞,初步的倒排索引的創建
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.

複製代碼

  • 執行搜索
mother like little dog,
    不可能有任何結果。
    這個是否是咱們想要的搜索結果???絕對不是,由於在咱們看來,mother和mom有區別嗎?同義詞,都是媽媽的意思。
    like和liked有區別嗎?沒有,都是喜歡的意思,只不過一個是如今時,一個是過去時。little和small有區別嗎?
    同義詞,都是小小的。dog和dogs有區別嗎?狗,只不過一個是單數,一個是複數。
複製代碼
  • normalization,創建倒排索引的時候,會執行一個操做,也就是說對拆分出的各個單詞進行相應的處理,以提高後面搜索的時候可以搜索到相關聯的文檔的機率
時態的轉換,單複數的轉換,同義詞的轉換,大小寫的轉換
        mom —> mother
        liked —> like
        small —> little
        dogs —> dog
複製代碼
  • 從新創建倒排索引,加入normalization,再次用mother liked little dog搜索,就能夠搜索到了app

3.3 什麼是分詞器

  • 切分詞語,normalization(提高recall召回率),例如:給你一段句子,而後將這段句子拆分紅一個一個的單個的單詞,同時對每一個單詞進行normalization(時態轉換,單複數轉換),分瓷器 recall,召回率:搜索的時候,增長可以搜索到的結果的數量
character filter:在一段文本進行分詞以前,先進行預處理,好比說最多見的就是,過濾
    html標籤(<span>hello<span> --> hello),& --> and(I&you --> I and you)
    
    tokenizer:分詞,hello you and me --> hello, you, and, me
    
    token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 幹掉,mother --> mom,small --> little
複製代碼
  • 一個分詞器,很重要,將一段文本進行各類處理,最後處理好的結果纔會拿去創建倒排索引elasticsearch

  • 內置分詞器的介紹post

Set the shape to semi-transparent by calling set_trans(5)
    
    standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard)
    
    simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans
    
    whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    
    language analyzer(特定的語言的分詞器,好比說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
複製代碼

3.4 遺留問題揭祕

  • query string分詞
query string必須以和index創建時相同的analyzer進行分詞
    query string對exact value和full text的區別對待
    
     知識點:不一樣類型的field,可能有的就是full text,有的就是exact value
    
    post_date,date:exact value
    _all:full text,分詞,normalization
    
    GET /_search?q=2017

    搜索的是_all field,document全部的field都會拼接成一個大串,進行分詞
    
    2017-01-02 my second article this is my second article in this website 11400
    
    		doc1		doc2		doc3
    2017		*		*		    *
    01		* 		
    02				*
    03						    *
    
    _all,2017,天然會搜索到3個docuemnt
    
    GET /_search?q=2017-01-01
    
    _all,2017-01-01,query string會用跟創建倒排索引同樣的分詞器去進行分詞
    
    2017
    01
    01
    
    GET /_search?q=post_date:2017-01-01

    date,會做爲exact value去創建索引
    
    		doc1		doc2		doc3
    2017-01-01	*		
    2017-01-02			* 		
    2017-01-03					*
    
    post_date:2017-01-01,2017-01-01,doc1一條document
    
    GET /_search?q=post_date:2017,這個在這裏不講解,由於是es 5.2之後作的一個優化
    
    
複製代碼
  • 測試分詞器
GET /_analyze
    {
      "analyzer": "standard",
      "text": "Text to analyze"
    }
複製代碼

3.5 Mapping原理

  • 往es裏面直接插入數據,es會自動創建索引,同時創建type以及對應的mapping測試

  • mapping中就自動定義了每一個field的數據類型大數據

  • 不一樣的數據類型(好比說text和date),可能有的是exact value,有的是full text優化

  • exact value,在創建倒排索引的時候,分詞的時候,是將整個值一塊兒做爲一個關鍵詞創建到倒排索引中的;full text,會經歷各類各樣的處理,分詞,normaliztion(時態轉換,同義詞轉換,大小寫轉換),纔會創建到倒排索引中

  • 同時呢,exact value和full text類型的field就決定了,在一個搜索過來的時候,對exact value field或者是full text field進行搜索的行爲也是不同的,會跟創建倒排索引的行爲保持一致;好比說exact value搜索的時候,就是直接按照整個值進行匹配,full text query string,也會進行分詞和normalization再去倒排索引中去搜索

  • 能夠用es的dynamic mapping,讓其自動創建mapping,包括自動設置數據類型;也能夠提早手動建立index和type的mapping,本身對各個field進行設置,包括數據類型,包括索引行爲,包括分詞器,等等

  • mapping,就是index的type的元數據,每一個type都有一個本身的mapping,決定了數據類型,創建倒排索引的行爲,還有進行搜索的行爲

3.6 Mapping數據結構

  • 核心的數據類型
string
    byte,short,integer,long
    float,double
    boolean
    date
複製代碼
  • dynamic mapping
true or false	-->	boolean
    123		-->	long
    123.45		-->	double
    2017-01-01	-->	date
    "hello world"	-->	string/text
複製代碼
  • 查看mapping
GET /index/_mapping/type
複製代碼

3.7 Mapping手動創建索引

  • 如何創建索引
analyzed
    not_analyzed
    no
複製代碼
  • 修改mapping
PUT /website
    {
      "mappings": {
        "article": {
          "properties": {
            "author_id": {
              "type": "long"
            },
            "title": {
              "type": "text",
              "analyzer": "english"
            },
            "content": {
              "type": "text"
            },
            "post_date": {
              "type": "date"
            },
            "publisher_id": {
              "type": "text",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
複製代碼
  • 以下修改索引,是會報錯的
PUT /website
        {
          "mappings": {
            "article": {
              "properties": {
                "author_id": {
                  "type": "text"
                }
              }
            }
          }
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "index_already_exists_exception",
                "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
                "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
                "index": "website"
              }
            ],
            "type": "index_already_exists_exception",
            "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
            "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
            "index": "website"
          },
          "status": 400
        }
複製代碼
  • 修改索引正確方法
PUT /website/_mapping/article
        {
          "properties" : {
            "new_field" : {
              "type" :    "string",
              "index":    "not_analyzed"
            }
          }
        }
複製代碼
  • 測試mapping
GET /website/_analyze
        {
          "field": "content",
          "text": "my-dogs" 
        }
        
        GET website/_analyze
        {
          "field": "new_field",
          "text": "my dogs"
        }
        
        {
          "error": {
            "root_cause": [
              {
                "type": "remote_transport_exception",
                "reason": "[4onsTYV][127.0.0.1:9300][indices:admin/analyze[s]]"
              }
            ],
            "type": "illegal_argument_exception",
            "reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
          },
          "status": 400
        }
複製代碼
  • multivalue field,創建索引時與string是同樣的,數據類型不能混
{ "tags": [ "tag1", "tag2" ]}
複製代碼
  • empty field
null,[],[null]
複製代碼
  • object field
PUT /company/employee/1
        {
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        address:object類型
        
        GET /company/_mapping/employee
        
        {
          "company": {
            "mappings": {
              "employee": {
                "properties": {
                  "address": {
                    "properties": {
                      "city": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "country": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "province": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "age": {
                    "type": "long"
                  },
                  "join_date": {
                    "type": "date"
                  },
                  "name": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              }
            }
          }
        }
        
複製代碼
  • ES底層數據結構1
{
          "address": {
            "country": "china",
            "province": "guangdong",
            "city": "guangzhou"
          },
          "name": "jack",
          "age": 27,
          "join_date": "2017-01-01"
        }
        
        {
            "name":            [jack],
            "age":          [27],
            "join_date":      [2017-01-01],
            "address.country":         [china],
            "address.province":   [guangdong],
            "address.city":  [guangzhou]
        }
複製代碼
  • ES底層數據結構2
{
            "authors": [
                { "age": 26, "name": "Jack White"},
                { "age": 55, "name": "Tom Jones"},
                { "age": 39, "name": "Kitty Smith"}
            ]
        }
        
        {
            "authors.age":    [26, 55, 39],
            "authors.name":   [jack, white, tom, jones, kitty, smith]
        }
複製代碼

4 Query DSL 分析

  • Query DSL使用
GET /_search
        {
            "query": {
                "match_all": {}
            }
        }
複製代碼
  • Query DSL的基本語法
{
            QUERY_NAME: {
                ARGUMENT: VALUE,
                ARGUMENT: VALUE,...
            }
        }
        
        {
            QUERY_NAME: {
                FIELD_NAME: {
                    ARGUMENT: VALUE,
                    ARGUMENT: VALUE,...
                }
            }
        }
        
        GET /test_index/test_type/_search 
        {
          "query": {
            "match": {
              "test_field": "test"
            }
          }
        }
複製代碼
  • 如何組合多個搜索條件,搜索需求:title必須包含elasticsearch,content能夠包含elasticsearch也能夠不包含,author_id必須不爲111
GET /website/article/_search
        {
          "query": {
            "bool": {
              "must": [
                {
                  "match": {
                    "title": "elasticsearch"
                  }
                }
              ],
              "should": [
                {
                  "match": {
                    "content": "elasticsearch"
                  }
                }
              ],
              "must_not": [
                {
                  "match": {
                    "author_id": 111
                  }
                }
              ]
            }
          }
        }
        
        GET /test_index/_search
        {
            "query": {
                    "bool": {
                        "must": { "match":   { "name": "tom" }},
                        "should": [
                            { "match":       { "hired": true }},
                            { "bool": {
                                "must":      { "match": { "personality": "good" }},
                                "must_not":  { "match": { "rude": true }}
                            }}
                        ],
                        "minimum_should_match": 1
                    }
            }
複製代碼

5 總結

生產部署還有不少工做要作,本文從初級思路切入,進行了問題的整合。

本套技術專欄做者(秦凱新)專一於大數據及容器雲核心技術解密,具有5年工業級IOT大數據雲平臺建設經驗,可提供全棧的大數據+雲原平生臺諮詢方案,請持續關注本套博客。QQ郵箱地址:1120746959@qq.com,若有任何學術交流,可隨時聯繫。

秦凱新

著做權歸做者全部。商業轉載請聯繫做者得到受權,非商業轉載請註明出處。

相關文章
相關標籤/搜索