一、Elasticsearch的打分公式node
Elasticsearch的默認打分公式是lucene的打分公式,主要分爲兩部分的計算,一部分是計算query部分的得分,另外一部分是計算field部分的得分,下面給出ES官網給出的打分公式:json
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)
對查詢進行一個歸一化,不影響排序,由於對於同一個查詢這個值是相同的,可是對term於ES來講,必須在分片是1的時候纔不影響排序,不然的話,仍是會有一些細小的區別,有幾個分片就會有幾個不一樣的queryNorm值ui
queryNorm(q)=1 / √sumOfSquaredWeights spa
上述公式是ES官網的公式,這是在默認query boost爲1,而且在默認term boost爲1 的狀況下的打分,其中code
sumOfSquaredWeights =idf(t1)*idf(t1)+idf(t2)*idf(t2)+...+idf(tn)*idf(tn)orm
其中n爲在query裏面切成term的個數,可是上面所有是在默認爲1的狀況下的計算,實際上的計算公式以下所示:排序
coord(q,d)是一個協調因子它的值以下:ip
coord(q,d)=overlap/maxoverlap
其中overlap是檢索命中query中term的個數,maxoverlap是query中總共的term個數,例如查詢詞爲「無線通訊」,使用默認分詞器,若是文檔爲「通知他們開會」,只會有一個「通」命中,這個時候它的值就是1/4=0.25ci
tf(t in d):文檔
即term t在文檔中出現的個數,它的計算公式官網給出的是:
tf(t in d) = √frequency
即出現的個數進行開方,這個沒什麼能夠講述的,實際打分也是如此
這個的意思是出現的逆詞頻數,即召回的文檔在總文檔中出現過多少次,這個的計算在ES中與lucene中有些區別,只有在分片數爲1的狀況下,與lucene的計算是一致的,若是不惟一,那麼每個分片都有一個不一樣的idf的值,它的計算方式以下所示:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
其中,log是以e爲底的,不是以10或者以2爲底,這點須要注意,numDocs是指全部的文檔個數,若是有分片的話,就是指的是在當前分片下總的文檔個數,docFreq是指召回文檔的個數,若是有分片對應的也是在當前分片下召回的個數,這點是計算的時候與lucene不一樣之處,若是想驗證是否正確,只需將分片shard的個數設置爲1便可。
對於每個term的權值,沒仔細研究這個項,我的理解的是,若是對一個field設置boost,那麼若是在這個boost召回的話,每個term的boost都是該field的boost
對於field的標準化因子,在官方給的解釋是field越短,若是召回的話權重越大,例如搜索無線通訊,一個是很長的內容,但都是包含這幾個字,可是並非咱們想要的,另一個內容很短,可是完整包含了無線通訊,咱們不能由於後面的只出現了一次就認爲權重是低的,相反,權重應當是更高的,其計算公式以下所示:
其中d.getboost代表若是該文檔權重越大那麼久越重要
f.getboost代表該field的權值越大,越重要
lengthnorm表示該field越長,越不重要,越短,越重要,在官方文檔給出的公式中,默認boost所有爲1,在此給出官方文檔的打分公式:
norm(d) = 1 / √numTerms
以上的是理論上的,看看實際例子
GET act_shop-2018.01.12/shop/_search { "size": 1, "query": { "term": { "name.keyword": "星巴克" } } , "explain": true }
結果是
{ "took": 25, "timed_out": false, "_shards": { "total": 150, "successful": 150, "failed": 0 }, "hits": { "total": 127667, "max_score": 15.511484, "hits": [ { "_shard": "[act_shop-2018.01.12][80]", "_node": "6vfIeV95QOK1vAcLdx6CEA", "_index": "act_shop-2018.01.12", "_type": "shop", "_id": "187672", "_score": 15.511484, "_routing": "36341", "_parent": "36341", "_source": { "status": 1, "city": { "id": 2084, "name": "虹口區" }, "update_time": "2017-10-23 15:23:00.329000", "tel": [ "021-65200108" ], "name": "星巴克(涼城店)", "tags": [ "餐飲服務", "咖啡廳", "咖啡廳" ], "tags_enrich": { "name": "美食", "id": 10 }, "id": 187672, "label": "have_act", "create_time": "2017-01-11 14:59:43.950000", "city_enrich": { "region": "華東地區", "name": "上海", "level": 1 }, "address": "車站南路330弄2號、6號第1、二層的4839F01059", "coordinate": { "lat": 31.29496, "lon": 121.475442 }, "brand": { "id": 490, "name": "星巴克" } }, "_explanation": { "value": 15.511484, "description": "sum of:", "details": [ { "value": 15.511484, "description": "sum of:", "details": [ { "value": 4.7601295, "description": "weight(name:星 in 6914) [PerFieldSimilarity], result of:", "details": [ { "value": 4.7601295, "description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 4.314013, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 159, "description": "docFreq", "details": [] }, { "value": 11920, "description": "docCount", "details": [] } ] }, { "value": 1.103411, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 9.224329, "description": "avgFieldLength", "details": [] }, { "value": 7.111111, "description": "fieldLength", "details": [] } ] } ] } ] }, { "value": 5.0423846, "description": "weight(name:巴 in 6914) [PerFieldSimilarity], result of:", "details": [ { "value": 5.0423846, "description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 4.5698156, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 123, "description": "docFreq", "details": [] }, { "value": 11920, "description": "docCount", "details": [] } ] }, { "value": 1.103411, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 9.224329, "description": "avgFieldLength", "details": [] }, { "value": 7.111111, "description": "fieldLength", "details": [] } ] } ] } ] }, { "value": 5.70897, "description": "weight(name:克 in 6914) [PerFieldSimilarity], result of:", "details": [ { "value": 5.70897, "description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 5.173929, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 67, "description": "docFreq", "details": [] }, { "value": 11920, "description": "docCount", "details": [] } ] }, { "value": 1.103411, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 9.224329, "description": "avgFieldLength", "details": [] }, { "value": 7.111111, "description": "fieldLength", "details": [] } ] } ] } ] } ] }, { "value": 0, "description": "match on required clause, product of:", "details": [ { "value": 0, "description": "# clause", "details": [] }, { "value": 1, "description": "_type:shop, product of:", "details": [ { "value": 1, "description": "boost", "details": [] }, { "value": 1, "description": "queryNorm", "details": [] } ] } ] } ] } } ] } }
詳細說明一下
一、在 "_shard": "[act_shop-2018.01.12][80]"這個分片裏,按照es的標準分詞,當match'星巴克'的時候,而後會分詞爲'星','巴','克'這三個詞。每一個詞的得分爲:
'星':4.7601295
'巴':5.0423846
'克':5.70897
總的得分:4.7601295+5.0423846+5.70897=15.511484
二、而後每一個詞是怎麼得分的,這裏詳細說一下,以'星'爲例:
sorce'星'=idf.tfNorm(也就是詞頻*逆向詞頻)
idf計算以下:
{ "value": 4.7601295, "description": "score(doc=6914,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 4.314013, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 159, "description": "docFreq", "details": [] }, { "value": 11920, "description": "docCount", "details": [] } ] }
docFreq:在這個分片裏,擊中'星'的文檔數量:159
docCount:在這個分片裏,包括總的文檔數量:11920
公式:log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))=4.314013
tfNorm計算以下
tf能夠理解爲,這個'星',在某個文檔裏出現的次數的一些佔比
{ "value": 1.103411, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 1.2, "description": "parameter k1", "details": [] }, { "value": 0.75, "description": "parameter b", "details": [] }, { "value": 9.224329, "description": "avgFieldLength", "details": [] }, { "value": 7.111111, "description": "fieldLength", "details": [] } ] }
tfNorm=(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))=1.103411
因此sorce'星'=idf.tfNorm=4.314013*1.103411=4.7601295