Elasticsearch 搜索打分計算原理淺析

時間 2020-02-22

原文原文鏈接

搜索打分計算幾個關鍵詞node

TF: token frequency ,某個搜索字段分詞後再document中字段(待搜索的字段)中出現的次數git
IDF：inverse document frequency，逆文檔頻率，某個搜索的字段在全部document中出現的次數取反算法
TFNORM：token frequency normalized，詞頻歸一化
BM25:算法：(freq + k1 * (1 - b + b * dl / avgdl))

兩個文檔以下：app

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "321697",
        "_score" : 6.6273837,
        "_source" : {
          "title" : "Steve Jobs"
      }
}

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "23706",
        "_score" : 6.0948296,
        "_source" : {
          "title" : "All About Steve"
      }
}

若是咱們經過title的match查詢ide

GET /movies/_search
{
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

那麼從打分結果就能夠看出第一個文檔打分高於第二個，這個具體緣由是：code

TF方面看在帶搜索字段上出現的頻率一致orm

IDF方面看在整個文檔中出現的頻率一致索引

TFNORM方面則不同了，第一個文檔中該詞佔比爲1/2,第二個文檔中該詞佔比爲1/3，故而第一個文檔在該搜索下打分比第二個索引高，因此ES算法時使用了TFNORM計算方式freq / (freq + k1 * (1 - b + b * dl / avgdl))token

最後的ES中的TF算法融合了詞頻歸一化和BM25three

若是咱們要查看具體Elasticsearch一個打分算法，則能夠經過以下命令展現

GET /movies/_search
{
  // 和MySQL的執行計劃相似
  "explain": true, 
  "query": {
    "match": {
      "title": "steve"
    }
  }
}

執行結果，查看其中一個

{
    "_shard": "[movies][1]",
    "_node": "pqNhgutvQfqcLqLEzIDnbQ",
    "_index": "movies",
    "_type": "_doc",
    "_id": "321697",
    "_score": 6.6273837,
    "_source": {
        "overview": "Set backstage at three iconic product launches and ending in 1998 with the unveiling of the iMac, Steve Jobs takes us behind the scenes of the digital revolution to paint an intimate portrait of the brilliant man at its epicenter.",
        "voteAverage": 6.8,
        "keywords": [
            {
                "id": 5565,
                "name": "biography"
            },
            {
                "id": 6104,
                "name": "computer"
            },
            {
                "id": 15300,
                "name": "father daughter relationship"
            },
            {
                "id": 157935,
                "name": "apple computer"
            },
            {
                "id": 161160,
                "name": "steve jobs"
            },
            {
                "id": 185722,
                "name": "based on true events"
            }
        ],
        "releaseDate": "2015-01-01T00:00:00.000Z",
        "runtime": 122,
        "originalLanguage": "en",
        "title": "Steve Jobs",
        "productionCountries": [
            {
                "iso_3166_1": "US",
                "name": "United States of America"
            }
        ],
        "revenue": 34441873,
        "genres": [
            {
                "id": 18,
                "name": "Drama"
            },
            {
                "id": 36,
                "name": "History"
            }
        ],
        "originalTitle": "Steve Jobs",
        "popularity": 53.670525,
        "tagline": "Can a great man be a good man?",
        "spokenLanguages": [
            {
                "iso_639_1": "en",
                "name": "English"
            }
        ],
        "id": 321697,
        "voteCount": 1573,
        "productionCompanies": [
            {
                "name": "Universal Pictures",
                "id": 33
            },
            {
                "name": "Scott Rudin Productions",
                "id": 258
            },
            {
                "name": "Legendary Pictures",
                "id": 923
            },
            {
                "name": "The Mark Gordon Company",
                "id": 1557
            },
            {
                "name": "Management 360",
                "id": 4220
            },
            {
                "name": "Cloud Eight Films",
                "id": 6708
            }
        ],
        "budget": 30000000,
        "homepage": "http://www.stevejobsthefilm.com",
        "status": "Released"
    },
    -          }
                ]
            }
        ]
    }
}

此時能夠看到結果多出瞭如下的一組數據（執行計劃）

{
    "_explanation": {
        "value": 6.6273837,
        // title字段值steve在全部匹配的1526個文檔中的權重
        "description": "weight(title:steve in 1526) [PerFieldSimilarity], result of:",
        "details": [
            {
                // value = idf.value * tf.value * 2.2
                // 6.6273837 = 6.4412656 * 0.46767938 * 2.2
                "value": 6.6273837,
                "description": "score(freq=1.0), product of:",
                "details": [
                    {
                        "value": 2.2,
                        // 放大因子，這個數值能夠在建立索引的時候指定，默認值是2.2
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 6.4412656,
                        "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                        "details": [
                            {
                                "value": 2,
                                "description": "n, number of documents containing term",
                                "details": []
                            },
                            {
                                "value": 1567,
                                "description": "N, total number of documents with field",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 0.46767938,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 1,
                                "description": "freq, occurrences of term within document",
                                "details": []
                            },
                            // 這塊提現了BM25算法（(freq + k1 * (1 - b + b * dl / avgdl))）
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            // 這塊就能夠提現出一個歸一化的操做算法
                            {
                                "value": 2,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 2.1474154,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}