Elasticsearch Query DSL 整理總結（四）—— Multi Match Query

時間 2019-11-11

標籤 elasticsearch query dsl 整理總結 multi match 欄目日誌分析简体版

原文原文鏈接

目錄html

該作的事情必定要作，決心要作的事情必定要作好json

——本傑明·富蘭克林app

引言

最近很喜歡使用思惟導圖來學習總結知識點，若是你對思惟導圖不太瞭解，又很是感興趣，請來看下這篇文章。此次介紹下 MutiMatch, 正文以前，請先看下本文的思惟導圖預熱下：elasticsearch

概要

multi_match 查詢創建在 match 查詢之上，重要的是它容許對多個字段查詢。ide

先構建一個實例, multimatch_test 中設置了兩個字段 subject 和 message , 使用 fields 參數在兩個字段上都查詢 multimatch ，從而獲得了兩個匹配文檔。性能

PUT multimatchtest
{
}

PUT multimatchtest/_mapping/multimatch_test
{
  "properties": {
    "subject": {
      "type": "text"
    },
    "message": {
      "type": "text"
    }
  }
}

PUT multimatchtest/multimatch_test/1
{
  "subject": "this is a multimatch test",
  "message": "blala blalba"
}

PUT multimatchtest/multimatch_test/2
{
  "subject": "blala blalba",
  "message": "this is a multimatch test"
}

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "multimatch",
      "fields": ["subject", "message"]
    }
  }
}

下面來說解下 fields 參數的使用學習

fields 字段

通配符

fields 字段中的值支持通配符* , 設置 mess* 依舊能夠查詢出 message 字段中的匹配。ui

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "multimatch",
      "fields": ["subject", "mess*"]
    }
  }
}

提高字段權重

在查詢字段後使用 ^ 符號能夠提升字段的權重，增長字段的分數 _score 。例如，咱們想增長 subject 字段的權重。this

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "multimatch",
      "fields": ["subject^3", "mess*"]
    }
  }
}

雖然文檔 1 和文檔 2 中都含有相同數量的 multimatch 詞條，但能夠看出，搜索結果中 subject 中含有multimatch 的分數是另外一個文檔的 3 倍。3d

"hits": {
    "total": 2,
    "max_score": 0.8630463,
    "hits": [
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "1",
        "_score": 0.8630463,
        "_source": {
          "subject": "this is a multimatch test",
          "message": "blala blalba"
        }
      },
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "subject": "blala blalba",
          "message": "this is a multimatch test"
        }
      }
    ]
  }
}

若是在 multimatch 查詢中不指定 fields 參數，默認會將文檔中的全部字段都匹配一遍。但不建議這麼作，可能會出現性能問題，也沒有什麼意義。

multi_match查詢的類型

multi_match 查詢內部到底如何執行主要取決於它的 type 參數，這個參數的可取得值以下

best_fields 是默認類型，會將任何與查詢匹配的文檔做爲結果返回，可是隻使用最佳字段的 _score 評分做爲評分結果返回。
most_fields 將任何與查詢匹配的文檔做爲結果返回，並全部匹配字段的評分合並起來
phrase 在 fields 中的每一個字段上均執行 match_phrase 查詢，並將最佳字段的 _score 做爲結果返回
phrase_prefix 在 fields 中的字段上均執行 match_phrase_prefix 查詢，並將每一個字段的分數進行合併

下面咱們來依次查看寫這些類型的意義和具體使用。

best_fields 類型

要搞懂 best_fields 類型，首先要了解下 dis_max 。

dis_max 分離最大化查詢

dis_max 查詢英文全稱爲 Disjunction Max Query 就是分離最大化查詢的意思。

分離（Disjunction）的意思是或（or），表示把同一個文檔中每一個字段上的查詢都分離開，分別計算出分數。
分離最大化查詢（Disjunction Max Query）指的是：將任何與任一查詢匹配的文檔做爲結果返回，但 只將最佳匹配的評分做爲查詢的評分結果返回

來看一個例子, 咱們將上面兩個文檔的內容重寫

PUT multimatchtest/multimatch_test/1
{
  "subject": "food is delicious!",
  "message": "cook food"
}

PUT multimatchtest/multimatch_test/2
{
  "subject": "blabla blala",
  "message": "I like chinese food"
}

這時咱們在 subject 和 message 兩個字段上都查詢 chinese food ，看獲得什麼結果？(咱們先不使用 multimatch 而是 match)

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "subject": "chinese food"
          }
        },
        {
          "match": {
            "message": "chinese food"
          }
        }
        ]
    }
  }
}

而獲得的結果則是

"hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "subject": "blabla blala",
          "message": "I like chinese food"
        }
      },
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "subject": "food is delicious!",
          "message": "cook food"
        }
      }
    ]
  }
}

雖然文檔 1 中的 subject 和 message 字段中都含有 food 可以匹配到，但因爲使用的 dis_max 查詢，只會將它們單獨計算得分，而文檔 2 中只有 message 匹配到，可是它的分數更高。由此比較，文檔 2 的得分固然比文檔 1 高，而這就是 best_fields 類型的計算方式。

best_fields

上個小節中的 dis_max 查詢則直接就能夠用

best_fields 在查詢多個詞條最佳匹配度方面是最有用的，它和 dis_max 方式是等價的。例如，上節中的 dis_max 查詢就能夠寫成下面的形式。並且 best_fields 類型是 multi_match 查詢時的默認類型。

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "chinese food",
      "fields": ["subject", "message"]
    }
  }
}

按照這種方式，只是最佳匹配語句起做用，其餘語句對分數一點貢獻度也沒有了。這樣太純粹了彷佛也不太好。有沒有折中的辦法，其餘語句也參與評分，只不過要打下折扣，讓它們的貢獻度不那麼高？嗯，還真有，這就是 tie_breaker 參數。

維權使者 tie_breaker

感受 tie_breaker 參數就是爲了維護其餘語句的權利而生的，先了解下它的評分方式：

先由 best_fields type 得到最佳匹配語句的評分 _score 。
將其餘匹配語句的評分結果與 tie_breaker 相乘。
對以上評分求和並規範化。

有了 tie_breaker ，世界變得更美好了，在計算時會考慮全部匹配語句，但tie_breaker 並無喧賓奪主，最佳匹配語句依然是老大，但其餘語句在 tie_breaker 的幫助下也有了必定的話語權。

將上節查詢語句添加一個 tie_breaker 參數纔來看結果。

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "chinese food",
      "fields": ["subject", "message"],
      "tie_breaker": 0.3
    }
  }
}

結果以下：

"hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "subject": "blabla blala",
          "message": "I like chinese food"
        }
      },
      {
        "_index": "multimatchtest",
        "_type": "multimatch_test",
        "_id": "1",
        "_score": 0.37398672,
        "_source": {
          "subject": "food is delicious!",
          "message": "cook food"
        }
      }
    ]
  }

和上節的文檔 1 的評分對比，因爲文檔 1 中 message 字段和 subject 都只有一個 "food" 單詞，它們的評分是同樣的，且 tie_breaker 爲 0.3，那就至關於 0.2876821x1.3=0.37398672 ，正好與結果吻合。

開篇時咱們就說到， multi-match 查詢是構建在 match 查詢基礎上的，所以 match 查詢的參數，multi-match 均可以使用，能夠參考我以前寫的 match query 文檔來查看。

most_fields

most_fields 主要用在多個字段都包含相同的文本的場合，會將全部字段的評分合並起來。

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "multimatch",
      "fields": ["subject", "message"],
      "type": "most_fields"
    }
  }
}

phrase 和 phrase_prefix

phrase 和 phrase_prefix 類型的行爲與 best_fields 參數相似，區別就是

phrase 使用 match_phrase & dis_max 實現
phrase_prefix 使用 match_phrase_prefix & dis_max 實現
best_fields 使用 match & dis_max 實現

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "multi_match": {
      "query": "this is",
      "fields": ["subject", "message"],
      "type": "phrase"
    }
  }
}

上面查詢等價於

GET multimatchtest/multimatch_test/_search
{
  "query": {
    "dis_max": {
      "queries": [{
        "match_phrase": {
          "subject": "this is"
        }
      },
      {
        "match_phrase": {
          "message": "this is"
        }
      }]
    }
  }
}

cross_fields

像 most_fields 和 best_fields 類型都是詞中心式(field-centric)，什麼意思呢？舉個例子，假如要查詢 "blabla like" 字符串，而且指定 operator 爲 and ，則會在同一個字段內搜索整個字符串，只有一個字段內都有這兩個詞，才匹配上。

GET multimatchtest/_search
{
  "query": {
    "multi_match": {
      "query": "blabla like",
      "operator": "and",
      "fields": [ "subject", "message"],
      "type": "best_fields"
    }
  }
}

而 cross_fields 類型則是字段中心式的，例如，要查詢 "blabla like" 字符串，查詢字段爲 "subject" 和 "message"。此時首先分析查詢字符串並生成一個詞列表，而後從全部字段中依次搜索每一個詞，只要查詢到，就算匹配上。

GET multimatchtest/_search
{
  "query": {
    "multi_match": {
      "query": "blabla like",
      "operator": "and",
      "fields": [ "subject", "message"],
      "type": "cross_fields"
    }
  }
}

評分

那麼 cross_fields 的評分是怎麼完成的呢？

cross_fields 也有 tie_breaker 配置，就是由它來控制 cross_fields 的評分。tie_breaker 的取值及意義以下：

0.0 獲取最佳字段的分數爲最終分數，默認值
1.0 將多個字段的分數合併
0.0 < n < 1.0 最佳字段評分與其它字段結合評分

GET multimatchtest/_search
{
  "query": {
    "multi_match": {
      "query": "blabla like",
      "fields": [ "subject", "message"],
      "type": "cross_fields",
      "tie_breaker": 0.5
    }
  }
}

小結

Muti-Match 是很是經常使用的全文搜索，它構建在 Match 查詢的基礎上，同時又添加了許多類型來符合多字段搜索的場景。最後，請在經過思惟導圖一塊兒來回顧下本節的知識點吧.

參考

https://www.elastic.co/guide/en/elasticsearch/reference/6.3/query-dsl-multi-match-query.html