Elasticsearch由淺入深（十）搜索引擎：相關度評分 TF&IDF算法、doc value正排索引、解密query、fetch phrase原理、Bouncing Results問題、基於sc

時間 2019-11-06

標籤 elasticsearch 由淺入深搜索引擎相關評分 tf&idf idf 算法 doc value 索引解密 query fetch phrase 原理 bouncing results 問題基於欄目日誌分析简体版

原文原文鏈接

相關度評分 TF&IDF算法

Elasticsearch的相關度評分（relevance score）算法採用的是term frequency/inverse document frequency算法，簡稱爲TF/IDF算法。node

算法介紹：算法

relevance score算法：簡單來講就是，就是計算出一個索引中的文本，與搜索文本，它們之間的關聯匹配程度。
TF/IDF算法：分爲兩個部分，IF 和IDF
Term Frequency(TF)：搜索文本中的各個詞條在field文本中出現了多少次，出現的次數越多，就越相關
例如：
搜索請求：hello world
doc1: hello you, and world is very good
doc2: hello, how are you
那麼此時根據TF算法，doc1的相關度要比doc2的要高
Inverse Document Frequency(IDF)：搜索文本中的各個詞條在整個索引的全部文檔中出現的次數，出現的次數越多，就越不相關。
搜索請求： hello world
doc1: hello, today is very good.
doc2: hi world, how are you.
好比在index中有1萬條document, hello這個單詞在全部的document中，一共出現了1000次，world這個單詞在全部的document中一共出現100次。那麼根據IDF算法此時doc2的相關度要比doc1要高。
field-length norm：field-length norm就是field長度越長，相關度就越弱
搜索請求：hello world
doc1: {"title": "hello article", "content": "1萬個單詞"}
doc2: {"title": "my article", "content": "1萬個單詞， hi world"}
此時hello world在整個index中出現的次數是同樣多的。可是根據Field-length norm此時doc1比doc2相關度要高。由於title字段更短。

_score是如何被計算出來的

GET /test_index/test_type/_search?explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.843298,
    "hits": [
      {
        "_shard": "[test_index][2]",
        "_node": "1LdqLFqxQQq4xg2MphI_gw",
        "_index": "test_index",
        "_type": "test_type",
        "_id": "6",
        "_score": 0.843298,
        "_source": {
          "test_field": "test test"
        },
        "_explanation": {
          "value": 0.843298,
          "description": "sum of:",
          "details": [
            {
              "value": 0.843298,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.843298,
                  "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 0.843298,
                      "description": "score(doc=0,freq=2.0 = termFreq=2.0\n), product of:",
                      "details": [
                        {
                          "value": 0.6931472,
                          "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details": [
                            {
                              "value": 2,
                              "description": "docFreq",
                              "details": []
                            },
                            {
                              "value": 4,
                              "description": "docCount",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 1.2166219,
                          "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                          "details": [
                            {
                              "value": 2,
                              "description": "termFreq=2.0",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "parameter k1",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "parameter b",
                              "details": []
                            },
                            {
                              "value": 1.75,
                              "description": "avgFieldLength",
                              "details": []
                            },
                            {
                              "value": 2.56,
                              "description": "fieldLength",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 1,
                  "description": "_type:test_type, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[test_index][1]",
        "_node": "1LdqLFqxQQq4xg2MphI_gw",
        "_index": "test_index",
        "_type": "test_type",
        "_id": "8",
        "_score": 0.43445712,
        "_source": {
          "test_field": "test client 2"
        },
        "_explanation": {
          "value": 0.43445715,
          "description": "sum of:",
          "details": [
            {
              "value": 0.43445715,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.43445715,
                  "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 0.43445715,
                      "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                      "details": [
                        {
                          "value": 0.47000363,
                          "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details": [
                            {
                              "value": 2,
                              "description": "docFreq",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "docCount",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 0.92436975,
                          "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "termFreq=1.0",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "parameter k1",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "parameter b",
                              "details": []
                            },
                            {
                              "value": 3.3333333,
                              "description": "avgFieldLength",
                              "details": []
                            },
                            {
                              "value": 4,
                              "description": "fieldLength",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 1,
                  "description": "_type:test_type, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[test_index][3]",
        "_node": "1LdqLFqxQQq4xg2MphI_gw",
        "_index": "test_index",
        "_type": "test_type",
        "_id": "7",
        "_score": 0.25316024,
        "_source": {
          "test_field": "test client 1"
        },
        "_explanation": {
          "value": 0.25316024,
          "description": "sum of:",
          "details": [
            {
              "value": 0.25316024,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.25316024,
                  "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 0.25316024,
                      "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                      "details": [
                        {
                          "value": 0.2876821,
                          "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "docFreq",
                              "details": []
                            },
                            {
                              "value": 1,
                              "description": "docCount",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 0.88,
                          "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "termFreq=1.0",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "parameter k1",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "parameter b",
                              "details": []
                            },
                            {
                              "value": 3,
                              "description": "avgFieldLength",
                              "details": []
                            },
                            {
                              "value": 4,
                              "description": "fieldLength",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 1,
                  "description": "*:*, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

View Code

doc value正排索引

在咱們搜索的時候，要依靠倒排索引，可是當咱們排序的時候，須要依靠正排索引。經過倒排索引鎖定文檔document以後，看到每一個document的每一個field，而後進行排序，所謂的正排索引就是doc values。緩存

對於ES而言，在創建索引的時候，一方面會創建倒排索引，以供搜索使用；一方面會創建正排索引，也就是doc values,以供排序，聚合，過濾等使用。ide

doc values是被保存在磁盤上的，此時若是內存足夠，OS操做系統會自動將其緩存在內存中，性能仍是會很高的，若是內存不夠用，OS操做系統會將其寫入磁盤。post

下面舉個例子描述正排索引和倒排索引
假設某個index有兩個doc性能

doc1 : hello world you and me
doc2 : hi world, how are you

創建倒排索引fetch

word    doc1    doc2
hello    *
world    *        *
you      *        *
and      *
me       *
hi                *
how               *
are               *

假設某個index有兩個docui

doc1: {"name": "jack", "age": 27}
doc2: {"name": "tom", "age": 30}

創建正排索引spa

document    name    age
doc1        jack    27
doc2        tom     30

解密query、fetch phrase原理

query pharse

基本原理：操作系統

搜索請求發送到某一個coordinate node協調節點，會構建一個priority queue，長度以paging操做from和size爲準，默認是10
coordinate node將請求轉發到全部的shard,每一個shard本地搜索，並構建一個本地的priority queue
各個shard將本身的priority queue返回給coordinate node,並構建一個全局的priority queue

fetch phrase

基本原理：

coordinate node協調節點構建完priority queue以後，就發送mget請求去全部shard上獲取對應的document
各個shard將document返回給coordinate node
coordinate node將合併後的document結果返回給客戶端。

也就是ES的query pharse是根據priority queue去構建搜索結果的

示例

好比總共有60000條數據，三個primary shard,每一個shard上分了20000條數據，每頁是10條數據，這個時候，你要搜索到第1000頁，實際上要拿到的是10001~10010，也就是會構建一個10010大小的priority queue。

注意這裏千萬不要理解成每一個shard都是返回10條數據。這樣理解是錯誤的！

下面作一下詳細的分析：
請求首先多是打到一個不包含這個index的shard的node上去，這個node就是一個協調節點coordinate node，那麼這個coordinate node就會將搜索請求轉發到index的三個shard所在的node上去。好比說咱們以前說的狀況下，要搜索60000條數據中的第1000頁，實際上每一個shard都要將內部的20000條數據中的第10001~10010條數據，拿出來，不是才10條，是10010條數據。3個shard的每一個shard都返回10010條數據給協調節點coordinate node，coordinate node會收到總共30030條數據,此時會構建一個30030大小的priority queue，而後在這些數據中進行排序，根據_score相關度分數，而後取到10001~10010這10條數據，就是咱們要的第1000頁的10條數據。
以下圖所示：

Bouncing Results問題

想象一下有兩個文檔有一樣值的時間戳字段，搜索結果用 timestamp 字段來排序。因爲搜索請求是在全部有效的分片副本間輪詢的，那就有可能發生主分片處理請求時，這兩個文檔是一種順序，而副本分片處理請求時又是另外一種順序。

bouncing results 問題:：每次用戶刷新頁面，搜索結果表現是不一樣的順序。讓同一個用戶始終使用同一個分片，這樣能夠避免這種問題，能夠設置 preference 參數爲一個特定的任意值好比用戶會話ID來解決。

偏好這個參數 preference 容許用來控制由哪些分片或節點來處理搜索請求。它接受像 _primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, 和 _shards:2,3 這樣的值, 這些值在 search preference 文檔頁面被詳細解釋。
可是最有用的值是某些隨機字符串，它能夠避免 bouncing results 問題。
timeout：已經講解過原理了，主要就是限定在必定時間內，將部分獲取到的數據直接返回，避免查詢耗時過長
routing：document文檔路由，_id路由，routing=user_id，這樣的話可讓同一個user對應的數據到一個shard上去
search_type：默認default：query_then_fetch，dfs_query_then_fetch能夠提高revelance sort精準度

基於scoll技術滾動搜索大量數據

在實際應用中，經過from+size不可避免會出現深分頁的瓶頸，那麼經過scoll技術就是一個很好的解決深分頁的方法。好比若是咱們一次性要查出10萬條數據，那麼使用from+size很顯然性能會很是的差，priority queue會很是的大。此時若是採用scroll滾動查詢，就能夠一批一批的查，直到全部數據都查詢完。

scroll原理

scoll搜索會在第一次搜索的時候，保存一個當時的視圖快照，以後只會基於該舊的視圖快照提供數據搜索，若是這個期間數據變動，是不會讓用戶看到的。並且ES內部是基於_doc進行排序的方式，性能較高。
示例：

# 使用scroll 
POST /test_index/_search?scroll=1m
{
  "query": {
    "match_all": {}
  },
  "sort": [
    "_doc"
  ],
  "size": 3
}

獲取到scroll_id

{
  "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAI-sFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACPqxYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3AAAAAAAAj68WMUxkcUxGcXhRUXE0eGcyTXBoSV9ndwAAAAAAAI-tFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACPrhYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3",
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 12,
    "max_score": null,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "AWypxxLYFCl_S-ox4wvd",
        "_score": null,
        "_source": {
          "test_content": "my test"
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "6",
        "_score": null,
        "_source": {
          "test_field": "test test"
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "7",
        "_score": null,
        "_source": {
          "test_field": "test client 1"
        },
        "sort": [
          0
        ]
      }
    ]
  }
}

View Code

滾動搜索

# 滾動搜索 
POST _search/scroll
{
  "scroll":"1m",
  "scroll_id":"DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAJDMFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACQzRYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3AAAAAAAAkM8WMUxkcUxGcXhRUXE0eGcyTXBoSV9ndwAAAAAAAJDOFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACQ0BYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3"
}

搜索結果

{
  "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAJDMFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACQzRYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3AAAAAAAAkM8WMUxkcUxGcXhRUXE0eGcyTXBoSV9ndwAAAAAAAJDOFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAACQ0BYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3",
  "took": 1,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 12,
    "max_score": null,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "11",
        "_score": null,
        "_source": {
          "num": 0,
          "tags": []
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "8",
        "_score": null,
        "_source": {
          "test_field": "test client 2"
        },
        "sort": [
          1
        ]
      },
      {
        "_index": "test_index",
        "_type": "test_type",
        "_id": "4",
        "_score": null,
        "_source": {
          "test_field": "test4"
        },
        "sort": [
          1
        ]
      }
    ]
  }
}