ES忽略TF-IDF評分——使用constant_score

時間 2019-11-20

標籤忽略 idf 評分使用 constant score 简体版

原文原文鏈接

Ignoring TF/IDF

Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:html

WiFi
Garden
Pool

The vacation home documents look something like this:json

{ "description": "A delightful four-bedroomed house with ... " }

We could use a simple match query:app

GET /_search { "query": { "match": { "description": "wifi garden pool" } } }

However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.less

constant_score Query

Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:elasticsearch

GET /_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "query": { "match": { "description": "pool" }} }} ] } } }

Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:ide

GET /_search { "query": { "bool": { "should": [ { "constant_score": { "query": { "match": { "description": "wifi" }} }}, { "constant_score": { "query": { "match": { "description": "garden" }} }}, { "constant_score": { "boost": 2  "query": { "match": { "description": "pool" }} }} ] } } }

A matching pool clause would add a score of 2, while the other clauses would add a score of only 1 each.ui

The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.this

We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:spa

{ "features": [ "wifi", "pool", "garden" ] } 這樣改寫有什麼好處？省索引空間嗎？

參考：https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html#ignoring-tfidfcode

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。