Elasticsearch深刻搜索之結構化搜索及JavaAPI的使用

時間 2020-07-07

標籤 elasticsearch 深刻搜索構化 javaapi 使用欄目日誌分析简体版

原文原文鏈接

1、Es中建立索引

1.建立索引：html

在以前的Es插件的安裝和使用中說到建立索引自定義分詞器和建立type，當時是分開寫的，其實建立索引時也能夠建立type，並指定分詞器。spring

PUT /my_index { "settings": { "analysis": { "analyzer": { "ik_smart_pinyin": { "type": "custom", "tokenizer": "ik_smart", "filter": ["my_pinyin", "word_delimiter"] }, "ik_max_word_pinyin": { "type": "custom", "tokenizer": "ik_max_word", "filter": ["my_pinyin", "word_delimiter"] } }, "filter": { "my_pinyin": { "type" : "pinyin", "keep_separate_first_letter" : true, "keep_full_pinyin" : true, "keep_original" : true, "limit_first_letter_length" : 16, "lowercase" : true, "remove_duplicated_term" : true } } } }, "mappings": { "my_type":{ "properties": { "id":{ "type": "integer" }, "name":{ "type": "text", "analyzer": "ik_max_word_pinyin" }, "age":{ "type":"integer" } } } } }

2.添加數據緩存

POST /my_index/my_type/_bulk
{ "index": { "_id":1}}
{ "id":1,"name": "張三","age":20}
{ "index": { "_id": 2}}
{ "id":2,"name": "張四","age":22}
{ "index": { "_id": 3}}
{ "id":3,"name": "張三李四王五","age":20}app

3.查看數據類型elasticsearch

GET /my_index/my_type/_mapping 結果： { "my_index": { "mappings": { "my_type": { "properties": { "age": { "type": "integer" }, "id": { "type": "integer" }, "name": { "type": "text", "analyzer": "ik_max_word_pinyin" } } } } } }

2、結合JAVA（在這以前需在項目中配置好es，網上有好多例子能夠參考）ide

1.建立Es實體類post

package com.example.es_query_list.entity.es;

import lombok.Getter;
import lombok.Setter;
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;

@Setter
@Getter
@Document(indexName = "my_index",type = "my_type")
public class User {
    @Id
    private Integer id;
    private String name;
    private Integer age;
}

2.建立dao層性能

package com.example.es_query_list.repository.es; import com.example.es_query_list.entity.es.User; import org.springframework.data.elasticsearch.repository.ElasticsearchRepository; public interface EsUserRepository extends ElasticsearchRepository<User,Integer> { }

3、基本工做完成後，開始查詢ui

1.精確值查詢spa

查詢非文本類型數據

GET /my_index/my_type/_search { "query": { "term": { "age": { "value": "20" } } } } 結果: { "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "name": "張三", "age": 20 } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "name": "李四", "age": 20 } } ] } }

2.查詢文本類型

{ "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }

這時小夥伴們可能看到查詢結果爲空，爲何精確匹配卻查不到我輸入的準確值呢？？？以前說過我們在建立type時，字段指定的分詞器，若是輸入未被分析出來的詞是查不到結果的，讓咱們證實一下！！！！

首先先查看一下我們查詢的詞被分析成哪幾部分

GET my_index/_analyze { "text":"張三李四王五", "analyzer": "ik_max_word" } 結果： { "tokens": [ { "token": "張三李四", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "張三", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "三", "start_offset": 1, "end_offset": 2, "type": "TYPE_CNUM", "position": 2 }, { "token": "李四", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 3 }, { "token": "四", "start_offset": 3, "end_offset": 4, "type": "TYPE_CNUM", "position": 4 }, { "token": "王", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 5 }, { "token": "五", "start_offset": 5, "end_offset": 6, "type": "TYPE_CNUM", "position": 6 } ] }

結果說明，張三李四王五被沒有被分析成張三李四王五，因此查詢結果爲空。

解決方法：更新type中字段屬性值，自定義一個映射指定類型爲keyword類型，該類型在es中是指不會被分詞器分析，也就是說這就是傳說中的準確不能再準確的值了

POST /my_index/_mapping/my_type { "properties": { "name": { "type": "text", "analyzer": "ik_max_word_pinyin", "fields": { "keyword":{  //自定義映射名
                "type": "keyword" } } } } }

設置好完成後，需將原有的數據刪除在添加一遍，再次查詢就能查到了

public List<User> termQuery() { QueryBuilder queryBuilder = QueryBuilders.termQuery("age",20); // QueryBuilder queryBuilder = QueryBuilders.termQuery("name.keyword","張三李四王五");
        SearchQuery searchQuery = new NativeSearchQueryBuilder() .withIndices("my_index") .withTypes("my_type") .withQuery(queryBuilder) .build(); List<User> list = template.queryForList(searchQuery,User.class); return list; }

4、組合過濾器

布爾過濾器

注意：官方文檔有點問題，在5.X後，filtered 被bool代替了，The filtered query is replaced by the bool query。

一個 bool 過濾器由三部分組成：

{ "bool" : { "must" : [], "should" : [], "must_not" : [], } }

must全部的語句都必須（must）匹配，與 AND 等價。

must_not全部的語句都不能（must not）匹配，與 NOT 等價。

should至少有一個語句要匹配，與 OR 等價。

就這麼簡單！當咱們須要多個過濾器時，只須將它們置入 bool 過濾器的不一樣部分便可。

GET /my_index/my_type/_search { "query" : { "bool" : { "should" : [ { "term" : {"age" : 20}}, { "term" : {"age" : 30}} ], "must" : { "term" : {"name.keyword" : "張三"} } } } }

public List<User> boolQuery() { BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery(); boolQueryBuilder.should(QueryBuilders.termQuery("age",20)); boolQueryBuilder.should(QueryBuilders.termQuery("age",30)); boolQueryBuilder.must(QueryBuilders.termQuery("name.keyword","張三")); SearchQuery searchQuery = new NativeSearchQueryBuilder() .withIndices("my_index") .withTypes("my_type") .withQuery(boolQueryBuilder) .build(); List<User> list = template.queryForList(searchQuery,User.class); return list; }

嵌套布爾過濾器

儘管 bool 是一個複合的過濾器，能夠接受多個子過濾器，須要注意的是 bool 過濾器自己仍然還只是一個過濾器。這意味着咱們能夠將一個 bool 過濾器置於其餘 bool 過濾器內部，這爲咱們提供了對任意複雜布爾邏輯進行處理的能力。

GET /my_index/my_type/_search { "query" : { "bool" : { "should" : [ { "term" : {"age" : 20}}, { "bool" : { "must": [ {"term": { "name.keyword": { "value": "李四" } }} ] }} ] } } } 結果： { "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "id": 1, "name": "張三", "age": 20 } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "id": 3, "name": "張三李四王五", "age": 20 } } ] } }

由於 term 和 bool 過濾器是兄弟關係，他們都處於外層的布爾邏輯 should 的內部，返回的命中文檔至少須匹配其中一個過濾器的條件。

這兩個 term 語句做爲兄弟關係，同時處於 must 語句之中，因此返回的命中文檔要必須都能同時匹配這兩個條件。

5、查找多個精確值

GET my_index/my_type/_search { "query": { "terms": { "age": [ 20, 22 ] } } } 結果： { "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1, "_source": { "id": 2, "name": "張四", "age": 22 } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "id": 1, "name": "張三", "age": 20 } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "id": 3, "name": "張三李四王五", "age": 20 } } ] } }

必定要了解 term 和 terms 是 包含（contains） 操做，而非 等值（equals） （判斷）。

TermsQueryBuilder termsQueryBuilder = QueryBuilders.termsQuery("age",list);

6、範圍查詢

一、數字範圍查詢

GET my_index/my_type/_search { "query": { "range": { "age": { "gte": 10, "lte": 20 } } } } 結果： { "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "id": 1, "name": "張三", "age": 20 } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "id": 3, "name": "張三李四王五", "age": 20 } } ] } }

注：gt(大於) gte(大於等於) lt(小於) lte(小於等於)

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("age").gte(10).lte(20);

2.對於時間範圍查詢

更新type，添加時間字段

POST /my_index/_mapping/my_type
{
"properties": {
"date":{
"type":"date",
"format":"yyyy-MM-dd"
}
}
}

添加數據：

POST /my_index/my_type/_bulk { "index": { "_id":4}} { "id":4,"name": "趙六","age":20,"date":"2018-10-1"} { "index": { "_id": 5}} { "id":5,"name": "對七","age":22,"date":"2018-11-20"} { "index": { "_id": 6}} { "id":6,"name": "王八","age":20,"date":"2018-7-28"}

查詢：

GET my_index/my_type/_search { "query": { "range": { "date": { "gte": "2018-10-20", "lte": "2018-11-29" } } } } 結果： { "took": 0, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "5", "_score": 1, "_source": { "id": 5, "name": "對七", "age": 22, "date": "2018-11-20" } } ] } }

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("date").gte("2018-10-20").lte("2018-11-29");

7、處理null值

1.添加數據

POST /my_index/posts/_bulk { "index": { "_id": "1" }} { "tags" : ["search"] } { "index": { "_id": "2" }} { "tags" : ["search", "open_source"] } { "index": { "_id": "3" }} { "other_field" : "some data" } { "index": { "_id": "4" }} { "tags" : null } { "index": { "_id": "5" }} { "tags" : ["search", null]          }

2.查詢指定字段存在的數據

GET /my_index/posts/_search { "query" : { "constant_score" : {    //不在去計算評分，默認都是1
            "filter" : { "exists" : { "field" : "tags" } } } } } 結果： { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "posts", "_id": "5", "_score": 1, "_source": { "tags": [ "search", null ] } }, { "_index": "my_index", "_type": "posts", "_id": "2", "_score": 1, "_source": { "tags": [ "search", "open_source" ] } }, { "_index": "my_index", "_type": "posts", "_id": "1", "_score": 1, "_source": { "tags": [ "search" ] } } ] } }

BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags")));

3.查詢指定字段缺失數據

注：Filter Query Missing 已經從 ES 5 版本移除

GET /my_index/posts/_search { "query" : { "bool": { "must_not": [ {"constant_score": { "filter": { "exists": { "field": "tags" }} }} ] } } } 查詢結果： { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "posts", "_id": "4", "_score": 1, "_source": { "tags": null } }, { "_index": "my_index", "_type": "posts", "_id": "3", "_score": 1, "_source": { "other_field": "some data" } } ] } }

注：處理null值，當字段內容爲空時，將自定義將其當作爲null值處理

boolQueryBuilder.mustNot(QueryBuilders.boolQuery().filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags"))));

8、關於緩存

1.核心

　　　其核心實際是採用一個 bitset 記錄與過濾器匹配的文檔。Elasticsearch 積極地把這些 bitset 緩存起來以備隨後使用。一旦緩存成功，bitset 能夠複用任何已使用過的相同過濾器，而無需再次計算整個過濾器。

這些 bitsets 緩存是「智能」的：它們以增量方式更新。當咱們索引新文檔時，只需將那些新文檔加入已有 bitset，而不是對整個緩存一遍又一遍的重複計算。和系統其餘部分同樣，過濾器是實時的，咱們無需擔憂緩存過時問題。

2.獨立的過濾器緩存

　　屬於一個查詢組件的 bitsets 是獨立於它所屬搜索請求其餘部分的。這就意味着，一旦被緩存，一個查詢能夠被用做多個搜索請求。bitsets 並不依賴於它所存在的查詢上下文。這樣使得緩存能夠加速查詢中常用的部分，從而下降較少、易變的部分所帶來的消耗。

一樣，若是單個請求重用相同的非評分查詢，它緩存的 bitset 能夠被單個搜索裏的全部實例所重用。

讓咱們看看下面例子中的查詢，它查找知足如下任意一個條件的電子郵件：

查詢條件（例子）：（1）在收件箱中，且沒有被讀過的（2）不在收件箱中，但被標註重要的

GET /inbox/emails/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ { "bool": { 1 "must": [ { "term": { "folder": "inbox" }}, { "term": { "read": false }} ] }}, { "bool": {　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　2　　　　 "must_not": { "term": { "folder": "inbox" } }, "must": { "term": { "important": true } } }} ] } } } } }

1和2共用的一個過濾器，因此使用同一個bitset

儘管其中一個收件箱的條件是 must 語句，另外一個是 must_not 語句，但他們二者是徹底相同的。這意味着在第一個語句執行後， bitset 就會被計算而後緩存起來供另外一個使用。當再次執行這個查詢時，收件箱的這個過濾器已經被緩存了，因此兩個語句都會使用已緩存的 bitset 。

這點與查詢表達式（query DSL）的可組合性結合得很好。它易被移動到表達式的任何地方，或者在同一查詢中的多個位置複用。這不只能方便開發者，並且對提高性能有直接的益處。

3.自動緩存行爲

在 Elasticsearch 的較早版本中，默認的行爲是緩存一切能夠緩存的對象。這也一般意味着系統緩存 bitsets 太富侵略性，從而由於清理緩存帶來性能壓力。不只如此，儘管不少過濾器都很容易被評價，但本質上是慢於緩存的（以及從緩存中複用）。緩存這些過濾器的意義不大，由於能夠簡單地再次執行過濾器。

檢查一個倒排是很是快的，而後絕大多數查詢組件卻不多使用它。例如 term 過濾字段 "user_id" ：若是有上百萬的用戶，每一個具體的用戶 ID 出現的機率都很小。那麼爲這個過濾器緩存 bitsets 就不是很合算，由於緩存的結果極可能在重用以前就被剔除了。

這種緩存的擾動對性能有着嚴重的影響。更嚴重的是，它讓開發者難以區分有良好表現的緩存以及無用緩存。

爲了解決問題，Elasticsearch 會基於使用頻次自動緩存查詢。若是一個非評分查詢在最近的 256 次查詢中被使用過（次數取決於查詢類型），那麼這個查詢就會做爲緩存的候選。可是，並非全部的片斷都能保證緩存 bitset 。只有那些文檔數量超過 10,000 （或超過總文檔數量的 3% )纔會緩存 bitset 。由於小的片斷能夠很快的進行搜索和合並，這裏緩存的意義不大。

一旦緩存了，非評分計算的 bitset 會一直駐留在緩存中直到它被剔除。剔除規則是基於 LRU 的：一旦緩存滿了，最近最少使用的過濾器會被剔除。