elasticsearch學習筆記高級篇（十三）——混合使用match和近似匹配實現召回率和精準度的平衡

時間 2020-01-02

標籤 elasticsearch 學習筆記高級十三混合使用 match 近似匹配實現召回率精準平衡欄目日誌分析简体版

原文原文鏈接

召回率和準確度

對於Elasticsearch而言
當使用match查詢的時候
召回率=匹配到的文檔數量/全部文檔的數量，因此匹配到的文檔數量越多，召回率就越高。
準確度指的就是匹配到的文檔中，咱們真正查詢想要的文檔相關度分數越高，返回結果中排在越前面，準確度就越高。java

match和match_phrase

咱們知道使用match匹配的話，若是咱們的搜索文本是java spark，那麼在返回結果中，只要包含有java或者是spark的文檔都會返回。因此只使用match匹配的話，查詢的召回率會很是高，可是準確度就會很低。性能優化

對於match_phrase短語搜索，會致使必須全部的term都在文檔的字段中出現，並且距離在slop限定範圍內才能匹配得上。若是咱們的搜索文本是java spark，那麼在返回結果中只包含java和只包含spark的文檔不會返回，而且若是文檔包含java也包含spark,可是距離範圍大於slop限定的範圍，那麼也不會返回。這樣準確度會很高，可是召回率就會太低，可能會沒有文檔返回，或是返回文檔過少。性能

match和match_phrase實現召回率和精準度的平衡

有時咱們可能但願匹配到幾個term中的部分，就能夠做爲結果返回，這樣就能夠提升召回率。同時咱們也但願用上match_phrase根據距離提高分數的功能，讓幾個term距離越近分數就越高，優先返回。也就是若是咱們的搜索文本是java spark，那麼在返回結果中只要包含java或者是spark的文檔就返回，可是若是文檔既包含java也包含spark，而且距離很是近，那麼這樣的文檔分數會很是高，會在結果中優先被返回。優化

實現方法：

用bool組合match和match_phrase,來實現，must條件中用match,保證儘可能匹配更多的結果，should中用match_phrase來提升咱們想要的文檔的相關度分數，讓這些文檔優先返回。
示例：
只使用matchspa

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "test_field": "java spark"
          }
        }
      ]
    }
  }
}

輸出結果：scala

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.031828,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.031828,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

只使用match_phrasecode

GET /test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "test_field": {
              "query": "java spark",
              "slop": 10
            }
          }
        }
      ]
    }
  }
}

輸出結果索引

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.7704125,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.7704125,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      }
    ]
  }
}

混合使用match和近似匹配實現召回率和精準度的平衡ip

GET /test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "test_field": "java spark"
          }
        }
      ],
      "should": [
        {
          "match_phrase": {
            "test_field": {
              "query": "java spark",
              "slop": 10
            }
          }
        }
      ]
    }
  }
}

輸出結果：文檔

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8022406,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8022406,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

使用rescoring機制優化近似匹配搜索的性能

match和match_phrase的區別

match: 只要簡單的匹配到了一個term，就會將term對應的文檔做爲結果返回，掃描倒排索引，掃描到了就完事
match_phrase: 首先要掃描到全部term的文檔列表，找到包含全部term的文檔列表，而後對每一個文檔都計算每一個term的position，是否符合指定的範圍，須要進行復雜的運算，才能判斷可否經過slop移動，匹配到這個文檔。

性能比較

match query的性能比match phrase和proximity match（有slop的match phrase）要高得多。由於後二者都須要計算position的距離
match query比natch_phrase的性能要高10倍，比proximity match（有slop的match phrase）要高20倍。
可是Elasticsearch性能是很強大的，基本都在毫秒級。match多是幾毫秒，match phrase和proximity match也基本在幾十毫秒和幾百毫秒以前。

性能優化

優化match_phrase和proximity match的性能，通常就是減小要進行proximity match搜索的文檔的數量。
主要的思路就是用match query先過濾出須要的數據，而後在用proximity match來根據term距離提升文檔的分數，同時proximity match只針對每一個shard的分數排名前n個文檔起做用，來從新調整它們的分數，這個過程稱之爲重打分rescoring。主要是由於通常用戶只會分頁查詢，只會看前幾頁的數據，因此不須要對全部的結果進行proximity match操做。也就是使用match + proximity match同時實現召回率和精準度。

默認狀況下，match也許匹配了1000個文檔，proximity match須要對每一個doc進行一遍運算，判斷可否slop移動匹配上，而後去貢獻本身的分數，可是不少狀況下，match出來也許是1000個文檔，其實用戶大部分狀況下都是分頁查詢的，能夠就看前5頁，每頁就10條數據，也就50個文檔。proximity match只要對前50個doc進行slop移動去匹配，去貢獻本身的分數便可，不須要對所有1000個doc都去進行計算和貢獻分數。這個時候經過window_size這個參數便可實現限制重打分rescoring的文檔數量。
示例：

GET /test_index/_search
{
  "query": {
    "match": {
      "test_field": "java spark"
    }
  },
  "rescore": {
    "query": {
      "rescore_query": {
        "match_phrase": {
          "test_field": {
            "query": "java spark",
            "slop": 10
          }
        }
      }
    },
    "window_size": 50
  }
}

輸出結果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.8022406,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.8022406,
        "_source" : {
          "test_field" : "spark is best big data solution based on scala ,an programming language similar to java spark"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.21110919,
        "_source" : {
          "test_field" : "i think java is the best programming language"
        }
      }
    ]
  }
}

能夠看到其實跟使用bool方式實現的效果是同樣的。