大數據利器Elasticsearch之全文本查詢之match_phrase查詢

這是我參與8月更文挑戰的第11天，活動詳情查看：8月更文挑戰
本Elasticsearch相關文章的版本爲：7.4.2markdown

測試數據：app

POST /match_phrase_test/_doc/1
{
  "my_text": "my favorite dialet is cold porridge"
}

POST /match_phrase_test/_doc/2
{
  "my_text": "when it's cold his favorite food is porridge"
}
複製代碼

match_phrase查詢

match_phrase查詢會對待查詢的文本進行分詞，而後對所獲得的分詞進行phrase查詢。post

例子：測試

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my favorite"
      }
    }
  }
}
複製代碼

分析：spa

my favorite 通過分詞獲得["my", "favorite"];
doc1這兩個分詞都具備且my後面緊跟favorite, 但doc2只具備favorite, 不知足短語要求；
因此返回doc1.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6520334,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6520334,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}
複製代碼

slop參數能夠設置容許調換文本順序的最大調換次數，此值是2的倍數。假如文檔裏記錄的是favorite food，輸入的查詢文本是food favorite, 那麼調整到和文檔favorite food的順序同樣須要調換步驟：code

food 放到 favorite 所在的位置；
favorite 放到 food 所在的位子。

總結：因此調換一個分詞須要2個slop，調換兩個分詞就須要4個slop，調換n個分詞須要最少2*n個slop, 也能夠理解爲使用(順序錯亂的分詞的個數-1)*2。
例子：
假如輸入my dialet favorite,那麼要命中doc1的my favorite dialet is cold porridge，由於dialet favorite的順序是錯亂的，只須要調換其中一個便可，所須要的最少slop就是1*2即2. 也能夠這樣計算：(順序錯亂的分詞的個數-1)*2 ==> (2-1)*2orm

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "my dialet favorite is",
        "slop": 2
      }
    }
  }
}
複製代碼

查詢結果：索引

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.9197583,
    "hits" : [
      {
        "_index" : "match_phrase_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.9197583,
        "_source" : {
          "my_text" : "my favorite dialet is cold porridge"
        }
      }
    ]
  }
}
複製代碼

也能夠使用analyzer這個參數指定在進行分詞時的分詞器，默認是使用所查詢的字段的mapping時所顯式指定的search_analyzer或索引的默認analyzer。ip

POST /match_phrase_test/_search
{
  "query": {
    "match_phrase": {
      "my_text": {
        "query": "favorite Dialet",
        "analyzer": "whitespace"
      }
    }
  }
}
複製代碼

由於指定analyzer爲whitespace，亦即按空格進行分詞，獲得["favorite", "Dialet"],
doc1的my_text在進行倒排索引分詞所使用的analyzer爲standard分詞器（以空格分詞，而後統一爲小寫字母），獲得的是["my", "favorite", "dialect", "is", "cold", "porridge"],
由於Dialet並存在doc1的倒排索引裏，因此doc1並不會被命中，因此查詢結果爲空。文檔

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
複製代碼