term vector是在Lucene中的一個概念,就是對於documents的某一field,如title,body這種文本類型的, 創建詞頻的多維向量空間.每個詞就是一個維度, 這個維度的值就是這個詞在這個field中的頻率。在Elasticsearch中termvectors返回在索引中特定文檔字段的統計信息,termvectors在Elasticsearch中是實時分析的,若是要想不實時分析,能夠設置realtime參數爲false。默認狀況下索引詞頻率統計是關閉的,須要在建索引的時候手工打開。app
注意:在Elasticsearch2.0版本以上用_termvectors代替_termvector。spa
下面咱們建一個打開了索引詞統計的索引。
日誌
請求:PUT http://localhost:9200/secilog/code
參數:orm
{ "mappings": { "log": { "properties": { "type": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "message": { "type": "string", "term_vector": "with_positions_offsets_payloads", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }
而後咱們插入兩條數據:
索引
請求:PUT http://localhost:9200/secilog/log/1/?prettytoken
參數:ci
{ "type" : "syslog", "message" : "secilog test test test " }
請求:PUT http://localhost:9200/secilog/log/2/?pretty文檔
參數:string
{ "type" : "file", "message" : "Another secilog test " }
當建立兩條日誌成功後,咱們用_termvectors來查詢統計結果。
請求:GET http://localhost:9200/secilog/log/1/_termvectors?pretty=true
返回結果以下:
{ "_index" : "secilog", "_type" : "log", "_id" : "1", "_version" : 1, "found" : true, "took" : 2, "term_vectors" : { "message" : { "field_statistics" : { "sum_doc_freq" : 5, "doc_count" : 2, "sum_ttf" : 7 }, "terms" : { "secilog" : { "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 7, "payload" : "d29yZA==" } ] }, "test" : { "term_freq" : 3, "tokens" : [ { "position" : 1, "start_offset" : 8, "end_offset" : 12, "payload" : "d29yZA==" }, { "position" : 2, "start_offset" : 13, "end_offset" : 17, "payload" : "d29yZA==" }, { "position" : 3, "start_offset" : 18, "end_offset" : 22, "payload" : "d29yZA==" } ] } } }, "type" : { "field_statistics" : { "sum_doc_freq" : 2, "doc_count" : 2, "sum_ttf" : 2 }, "terms" : { "syslog" : { "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 6, "payload" : "d29yZA==" } ] } } } } }
從中能夠看出,每一個字段,每一個單詞出現的次數和位置。須要注意的是對這些字段統計不是徹底精確的,已刪除的文件未被考慮在內,信息統計所請求的文檔只統計所在的分片,除非DFS設置爲true。所以,索引詞的統計數據對於瞭解索引詞的頻率有參考意義,默認狀況下當狀況索引詞頻率查詢的時候,系統會隨機的指定一個分片進行統計,若是使用routing 能夠查詢具體某個分片的統計狀況。對於索引詞統計,還能夠指定參數查詢,例如:
請求:POST http://localhost:9200/secilog/log/1/_termvectors?pretty=true
參數:
{ "fields" : ["message"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }
返回結果:
{ "_index" : "secilog", "_type" : "log", "_id" : "1", "_version" : 1, "found" : true, "took" : 2, "term_vectors" : { "message" : { "field_statistics" : { "sum_doc_freq" : 5, "doc_count" : 2, "sum_ttf" : 7 }, "terms" : { "secilog" : { "doc_freq" : 2, "ttf" : 2, "term_freq" : 1, "tokens" : [ { "position" : 0, "start_offset" : 0, "end_offset" : 7, "payload" : "d29yZA==" } ] }, "test" : { "doc_freq" : 2, "ttf" : 4, "term_freq" : 3, "tokens" : [ { "position" : 1, "start_offset" : 8, "end_offset" : 12, "payload" : "d29yZA==" }, { "position" : 2, "start_offset" : 13, "end_offset" : 17, "payload" : "d29yZA==" }, { "position" : 3, "start_offset" : 18, "end_offset" : 22, "payload" : "d29yZA==" } ] } } } } }
從上面的查詢中能夠看出,對統計進行了過濾,只查詢了一部分的統計。
須要注意的是打開了索引詞頻率會增長系統的負擔,除非特別有必要才須要打開統計。
賽克藍德(secisland)後續會逐步對Elasticsearch的最新版本的各項功能進行分析,近請期待。也歡迎加入secisland公衆號進行關注。