基於term vector深刻探查數據

時間 2019-11-06
標籤基於 term vector 深刻探查數據欄目 Java 简体版
原文原文鏈接
1、term vector介紹

獲取document中的某個field內的各個term的統計信息

term information: term frequency in the field, term positions, start and end offsets, term payloads

term statistics: 設置term_statistics=true; total term frequency, 一個term在全部document中出現的頻率; document frequency，有多少document包含這個term

field statistics: document count，有多少document包含這個field; sum of document frequency，一個field中全部term的df之和; sum of total term frequency，一個field中的全部term的tf之和

GET /twitter/tweet/1/_termvectors
GET /twitter/tweet/1/_termvectors?fields=text

term statistics和field statistics並不精準，不會被考慮有的doc可能被刪除了

我告訴你們，其實不多用，用的時候，通常來講，就是你須要對一些數據作探查的時候。好比說，你想要看到某個term，某個詞條，大話西遊，這個詞條，在多少個document中出現了。或者說某個field，film_desc，電影的說明信息，有多少個doc包含了這個說明信息。

二、index-iime term vector實驗

term vector，涉及了不少的term和field相關的統計信息，有兩種方式能夠採集到這個統計信息

（1）index-time，你在mapping裏配置一下，而後創建索引的時候，就直接給你生成這些term和field的統計信息了
（2）query-time，你以前沒有生成過任何的Term vector信息，而後在查看term vector的時候，直接就能夠看到了，會on the fly，現場計算出各類統計信息，而後返回給你

這一講，不會手敲任何命令，直接copy我作好的命令，由於這一講的重點，不是掌握什麼搜索或者聚合的語法，而是說，掌握，如何採集term vector信息，而後如何看懂term vector信息，你能掌握利用term vector進行數據探查

PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "text",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}


PUT /my_index/my_type/1
{
"fullname" : "Leo Li",
"text" : "hello test test test "
}

PUT /my_index/my_type/2
{
"fullname" : "Leo Li",
"text" : "other hello test ..."
}

GET /my_index/my_type/1/_termvectors
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_version": 1,
"found": true,
"took": 10,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6,
"doc_count": 2,
"sum_ttf": 8
},
"terms": {
"hello": {
"doc_freq": 2,
"ttf": 2,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 5,
"payload": "d29yZA=="
}
]
},
"test": {
"doc_freq": 2,
"ttf": 4,
"term_freq": 3,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 10,
"payload": "d29yZA=="
},
{
"position": 2,
"start_offset": 11,
"end_offset": 15,
"payload": "d29yZA=="
},
{
"position": 3,
"start_offset": 16,
"end_offset": 20,
"payload": "d29yZA=="
}
]
}
}
}
}
}

三、query-time term vector實驗

GET /my_index/my_type/1/_termvectors
{
"fields" : ["fullname"],
"offsets" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

通常來講，若是條件容許，你就用query time的term vector就能夠了，你要探查什麼數據，現場去探查一下就行了

4、手動指定doc的term vector

GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

手動指定一個doc，實際上不是要指定doc，而是要指定你想要安插的詞條，hello test，那麼就能夠放在一個field中

將這些term分詞，而後對每一個term，都去計算它在現有的全部doc中的一些統計信息

這個挺有用的，可讓你手動指定要探查的term的數據狀況，你就能夠指定探查「大話西遊」這個詞條的統計信息

5、手動指定analyzer來生成term vector

GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"text": "standard"
}
}

6、terms filter

GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"filter" : {
"max_num_terms" : 3,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}

這個就是說，根據term統計信息，過濾出你想要看到的term vector統計結果
也挺有用的，好比你探查數據把，能夠過濾掉一些出現頻率太低的term，就不考慮了

7、multi term vector

GET _mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"term_statistics": true
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"fields": [
"text"
]
}
]
}

GET /my_index/_mtermvectors
{
"docs": [
{
"_type": "test",
"_id": "2",
"fields": [
"text"
],
"term_statistics": true
},
{
"_type": "test",
"_id": "1"
}
]
}

GET /my_index/my_type/_mtermvectors
{
"docs": [
{
"_id": "2",
"fields": [
"text"
],
"term_statistics": true
},
{
"_id": "1"
}
]
}

GET /_mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
}
},
{
"_index": "my_index",
"_type": "my_type",
"doc" : {
"fullname" : "Leo Li",
"text" : "other hello test ..."
}
}
]
}