Elasticsearch Field Options Norms

時間 2019-11-05

標籤 elasticsearch field options norms 欄目日誌分析简体版

原文原文鏈接

Elasticsearch 定義字段時Norms選項的做用

本文介紹ElasticSearch中2種字段(text 和 keyword)的Norms參數做用。html

建立ES索引時，通常指定2種配置信息：settings、mappings。settings 與數據存儲有關（幾個分片、幾個副本）；而mappings 是數據模型，相似於MySQL中的表結構定義。在Mapping信息中指定每一個字段的類型，ElasticSearch支持多種類型的字段(field datatypes)，好比String、Numeric、Date…其中String又細分紅爲種：keyword 和 text。在建立索引時，須要定義字段併爲每一個字段指定類型，示例以下：java

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "title": {
          "type": "text",
          "norms": false
        },
        "overview": {
          "type": "text",
          "norms": true
        },
        "body": {
          "type": "text"
        },
        "author": {
          "type": "keyword",
          "norms": true
        },
        "chapters": {
          "type": "keyword",
          "norms": false
        },
        "email": {
          "type": "keyword"
        }
      }
    }
  }
}

my_index 索引的 title 字段類型是 text，而 author 字段類型是 keyword。算法

對於 text 類型的字段而言，默認開啓了norms，而 keyword 類型的字段則默認關閉了normsapp

Whether field-length should be taken into account when scoring queries. Accepts true（text filed datatype） or false(keyword filed datatype)elasticsearch

爲何 keyword 類型的字段默認關閉 norms 呢？keyword 類型的string 可理解爲：Do index the field, but don't analyze the string value，也即：keyword 類型的字段是不會被Analyzer "分析成" 一個個的term的，它是一個single-token fields，所以也就不須要字段長度(fieldNorm)、tfNorm（term frequency Norm）這些歸一化因子了。而 text 類型的字段會被分析器(Analyzer)分析，生成若干個terms，兩個 text 類型的字段，一個可能有不少term(好比文章的正文)，另外一個只有不多的term(好比文章的標題)，在多字段查詢時，就須要長度歸一化，這就是爲何 text 類型字段默認開啓 norms 選項的緣由吧。另外，對於Lucene經常使用的2種評分算法：tf-idf 和 bm25，tf-idf 就傾向於給長度較小的字段打高分，爲何呢？Lucene 的類似度評分公式，主要由三部分組成：IDF score，TF score 還有 fieldNorms。就TF-IDF評分公式而言，IDF score 是log(numDocs/(docFreq+1))，TF score 是 sqrt(tf)，fieldNorms 是 1/sqrt(length)，所以：文檔長度越短，fieldNorms越大，評分越高，這也是爲何TF-IDF嚴重偏向於給短文本打高分的緣由。ide

norms 做用是什麼？

norms 是一個用來計算文檔/字段得分(Score)的"調節因子"。TF-IDF、BM25算法計算文檔得分時都用到了norms參數，具體可參考這篇文章中的Lucene文檔得分計算公式。ui

ElasticSearch中的一篇文檔(Document)，裏面有多個字段。查詢解析器(QueryParser)將用戶輸入的查詢字符串解析成Terms ，在多字段搜索中，每一個 Term 會去匹配各個字段，爲每一個字段計算一個得分，各個字段的得分通過某種方式(以詞爲中心的搜索 vs 以字段爲中心的搜索)組合起來，最終獲得一篇文檔的得分。this

ES官方文檔關於Norms解釋：code

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.orm

這裏的 normalization factors 用於查詢計算文檔得分時進行 boosting。好比根據BM25算法給出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))計算文檔得分時，其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代價

開啓norms以後，每篇文檔的每一個字段須要一個字節存儲norms。對於 text 類型的字段而言是默認開啓norms的，所以對於不須要評分的 text 類型的字段，能夠禁用norms，這算是一個調優勢吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子屬於 Index-time boosting一部分，也即：在索引文檔(寫入文檔)的時候，就已經將全部boosting因子存儲起來，在查詢時從內存中讀取，參與得分計算。參考《Lucene in action》中一段話：

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另外一種類型的 boosting 是search time boosting，在查詢語句中指定boosting因子，而後動態計算出文檔得分，具體可參考：《relevant search with applications for solr and elasticsearch》，本文再也不詳述。可是值得注意的是：目前的ES版本已經再也不推薦使用index time boosting了，而是推薦使用 search time boosting。ES官方文檔給出的理由以下：

在索引文檔時存儲的boosting因子(開啓 norms 選項)，一經存儲，就沒法改變。要想改變，只能reindex索引
search time boosting 的效果和 index time boosting是同樣的，而且search time boosting可以動態指定boosting因子(但計算文檔得分時更消耗CPU吧)，靈活性更大。而index time boosting須要額外的存儲空間
index time boosting因子存儲在norms字段，它影響了 field length normalization，從而致使文檔類似度計算結果不太準確(lower quality relevance calculations)

附：my_index索引的mapping 信息：

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "_doc": {
        "properties": {
          "author": {
            "type": "keyword",
            "norms": true
          },
          "body": {
            "type": "text"
          },
          "chapters": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "overview": {
            "type": "text"
          },
          "title": {
            "type": "text",
            "norms": false
          }
        }
      }
    }
  }
}

原文：http://www.javashuo.com/article/p-dnaymfdf-bo.html

相關標籤/搜索

norms

field

options

elasticsearch+elasticsearch

elasticsearch

selenium+options

mybatis@options

indexwriter+document+field

elasticsearch+kibana

springboot+elasticsearch

日誌分析

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。