Elasticsearch 2.20 文檔篇：索引詞頻率

時間 2019-11-17

原文原文鏈接

term vector是在Lucene中的一個概念，就是對於documents的某一field,如title,body這種文本類型的, 創建詞頻的多維向量空間.每個詞就是一個維度, 這個維度的值就是這個詞在這個field中的頻率。在Elasticsearch中termvectors返回在索引中特定文檔字段的統計信息，termvectors在Elasticsearch中是實時分析的，若是要想不實時分析，能夠設置realtime參數爲false。默認狀況下索引詞頻率統計是關閉的，須要在建索引的時候手工打開。app

注意：在Elasticsearch2.0版本以上用_termvectors代替_termvector。spa

下面咱們建一個打開了索引詞統計的索引。
日誌

請求：PUT http://localhost:9200/secilog/code

參數：orm

{
  "mappings": {
    "log": {
      "properties": {
        "type": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "message": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

而後咱們插入兩條數據：
索引

請求：PUT http://localhost:9200/secilog/log/1/?prettytoken

參數：ci

{
  "type" : "syslog",
  "message" : "secilog test test test "
}

請求：PUT http://localhost:9200/secilog/log/2/?pretty文檔

參數：string

{
  "type" : "file",
  "message" : "Another secilog test "
}

當建立兩條日誌成功後，咱們用_termvectors來查詢統計結果。

請求：GET http://localhost:9200/secilog/log/1/_termvectors?pretty=true

返回結果以下：

{
  "_index" : "secilog",
  "_type" : "log",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "message" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 2,
        "sum_ttf" : 7
      },
      "terms" : {
        "secilog" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7,
            "payload" : "d29yZA=="
          } ]
        },
        "test" : {
          "term_freq" : 3,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12,
            "payload" : "d29yZA=="
          }, {
            "position" : 2,
            "start_offset" : 13,
            "end_offset" : 17,
            "payload" : "d29yZA=="
          }, {
            "position" : 3,
            "start_offset" : 18,
            "end_offset" : 22,
            "payload" : "d29yZA=="
          } ]
        }
      }
    },
    "type" : {
      "field_statistics" : {
        "sum_doc_freq" : 2,
        "doc_count" : 2,
        "sum_ttf" : 2
      },
      "terms" : {
        "syslog" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 6,
            "payload" : "d29yZA=="
          } ]
        }
      }
    }
  }
}

從中能夠看出，每一個字段，每一個單詞出現的次數和位置。須要注意的是對這些字段統計不是徹底精確的，已刪除的文件未被考慮在內，信息統計所請求的文檔只統計所在的分片，除非DFS設置爲true。所以，索引詞的統計數據對於瞭解索引詞的頻率有參考意義，默認狀況下當狀況索引詞頻率查詢的時候，系統會隨機的指定一個分片進行統計，若是使用routing 能夠查詢具體某個分片的統計狀況。對於索引詞統計，還能夠指定參數查詢，例如：

請求：POST http://localhost:9200/secilog/log/1/_termvectors?pretty=true

參數：

{
  "fields" : ["message"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

返回結果：

{
  "_index" : "secilog",
  "_type" : "log",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "message" : {
      "field_statistics" : {
        "sum_doc_freq" : 5,
        "doc_count" : 2,
        "sum_ttf" : 7
      },
      "terms" : {
        "secilog" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7,
            "payload" : "d29yZA=="
          } ]
        },
        "test" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 3,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12,
            "payload" : "d29yZA=="
          }, {
            "position" : 2,
            "start_offset" : 13,
            "end_offset" : 17,
            "payload" : "d29yZA=="
          }, {
            "position" : 3,
            "start_offset" : 18,
            "end_offset" : 22,
            "payload" : "d29yZA=="
          } ]
        }
      }
    }
  }
}

從上面的查詢中能夠看出，對統計進行了過濾，只查詢了一部分的統計。

須要注意的是打開了索引詞頻率會增長系統的負擔，除非特別有必要才須要打開統計。

賽克藍德(secisland)後續會逐步對Elasticsearch的最新版本的各項功能進行分析，近請期待。也歡迎加入secisland公衆號進行關注。