1、term vector介紹 獲取document中的某個field內的各個term的統計信息 term information: term frequency in the field, term positions, start and end offsets, term payloads term statistics: 設置term_statistics=true; total term frequency, 一個term在全部document中出現的頻率; document frequency,有多少document包含這個term field statistics: document count,有多少document包含這個field; sum of document frequency,一個field中全部term的df之和; sum of total term frequency,一個field中的全部term的tf之和 GET /twitter/tweet/1/_termvectors GET /twitter/tweet/1/_termvectors?fields=text term statistics和field statistics並不精準,不會被考慮有的doc可能被刪除了 我告訴你們,其實不多用,用的時候,通常來講,就是你須要對一些數據作探查的時候。好比說,你想要看到某個term,某個詞條,大話西遊,這個詞條,在多少個document中出現了。或者說某個field,film_desc,電影的說明信息,有多少個doc包含了這個說明信息。 二、index-iime term vector實驗 term vector,涉及了不少的term和field相關的統計信息,有兩種方式能夠採集到這個統計信息 (1)index-time,你在mapping裏配置一下,而後創建索引的時候,就直接給你生成這些term和field的統計信息了 (2)query-time,你以前沒有生成過任何的Term vector信息,而後在查看term vector的時候,直接就能夠看到了,會on the fly,現場計算出各類統計信息,而後返回給你 這一講,不會手敲任何命令,直接copy我作好的命令,由於這一講的重點,不是掌握什麼搜索或者聚合的語法,而是說,掌握,如何採集term vector信息,而後如何看懂term vector信息,你能掌握利用term vector進行數據探查 PUT /my_index { "mappings": { "my_type": { "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "text", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } } PUT /my_index/my_type/1 { "fullname" : "Leo Li", "text" : "hello test test test " } PUT /my_index/my_type/2 { "fullname" : "Leo Li", "text" : "other hello test ..." } GET /my_index/my_type/1/_termvectors { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } { "_index": "my_index", "_type": "my_type", "_id": "1", "_version": 1, "found": true, "took": 10, "term_vectors": { "text": { "field_statistics": { "sum_doc_freq": 6, "doc_count": 2, "sum_ttf": 8 }, "terms": { "hello": { "doc_freq": 2, "ttf": 2, "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 5, "payload": "d29yZA==" } ] }, "test": { "doc_freq": 2, "ttf": 4, "term_freq": 3, "tokens": [ { "position": 1, "start_offset": 6, "end_offset": 10, "payload": "d29yZA==" }, { "position": 2, "start_offset": 11, "end_offset": 15, "payload": "d29yZA==" }, { "position": 3, "start_offset": 16, "end_offset": 20, "payload": "d29yZA==" } ] } } } } } 三、query-time term vector實驗 GET /my_index/my_type/1/_termvectors { "fields" : ["fullname"], "offsets" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } 通常來講,若是條件容許,你就用query time的term vector就能夠了,你要探查什麼數據,現場去探查一下就行了 4、手動指定doc的term vector GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } 手動指定一個doc,實際上不是要指定doc,而是要指定你想要安插的詞條,hello test,那麼就能夠放在一個field中 將這些term分詞,而後對每一個term,都去計算它在現有的全部doc中的一些統計信息 這個挺有用的,可讓你手動指定要探查的term的數據狀況,你就能夠指定探查「大話西遊」這個詞條的統計信息 5、手動指定analyzer來生成term vector GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "per_field_analyzer" : { "text": "standard" } } 6、terms filter GET /my_index/my_type/_termvectors { "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" }, "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true, "filter" : { "max_num_terms" : 3, "min_term_freq" : 1, "min_doc_freq" : 1 } } 這個就是說,根據term統計信息,過濾出你想要看到的term vector統計結果 也挺有用的,好比你探查數據把,能夠過濾掉一些出現頻率太低的term,就不考慮了 7、multi term vector GET _mtermvectors { "docs": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "term_statistics": true }, { "_index": "my_index", "_type": "my_type", "_id": "1", "fields": [ "text" ] } ] } GET /my_index/_mtermvectors { "docs": [ { "_type": "test", "_id": "2", "fields": [ "text" ], "term_statistics": true }, { "_type": "test", "_id": "1" } ] } GET /my_index/my_type/_mtermvectors { "docs": [ { "_id": "2", "fields": [ "text" ], "term_statistics": true }, { "_id": "1" } ] } GET /_mtermvectors { "docs": [ { "_index": "my_index", "_type": "my_type", "doc" : { "fullname" : "Leo Li", "text" : "hello test test test" } }, { "_index": "my_index", "_type": "my_type", "doc" : { "fullname" : "Leo Li", "text" : "other hello test ..." } } ] }