來自stackoverflowhtml
https://stackoverflow.com/questions/6467067/how-to-search-for-a-part-of-a-word-with-elasticsearchgit
// 初始化數據 POST /my_idx/my_type/_bulk {"index": {"_id": "1"}} {"name": "John Doeman", "function": "Janitor"} {"index": {"_id": "2"}} {"name": "Jane Doewoman", "function": "Teacher"} {"index": {"_id": "3"}} {"name": "Jimmy Jackal", "function": "Student"}
ElasticSearch中有數據以下:json
{ "_id" : "1", "name" : "John Doeman", "function" : "Janitor" } { "_id" : "2", "name" : "Jane Doewoman", "function" : "Teacher" } { "_id" : "3", "name" : "Jimmy Jackal", "function" : "Student" }
如今指望搜索全部包含Doe
的文檔elasticsearch
// 並無返回任何文檔 GET /my_idx/my_type/_search?q=Doe
// 返回一個文檔 GET /my_idx/my_type/_search?q=Doeman
提問者還更換了分詞器,改用請求體的方式,但這也不行:ide
GET /my_idx/my_type/_search { "query": { "term": { "name": "Doe" } } }
後來使用了nGram
的tokenizer
和filter
測試
{ "index": { "index": "my_idx", "type": "my_type", "bulk_size": "100", "bulk_timeout": "10ms", "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "my_ngram_tokenizer", "filter": [ "my_ngram_filter" ] } }, "filter": { "my_ngram_filter": { "type": "nGram", "min_gram": 1, "max_gram": 1 } }, "tokenizer": { "my_ngram_tokenizer": { "type": "nGram", "min_gram": 1, "max_gram": 1 } } } } }
引入了另一個問題:任意的查詢均可以返回全部文檔ui
首先這是一個分詞引發的問題,索引默認狀況下使用standard
分詞器,對於文檔:code
{ "_id" : "1", "name" : "John Doeman", "function" : "Janitor" } { "_id" : "2", "name" : "Jane Doewoman", "function" : "Teacher" } { "_id" : "3", "name" : "Jimmy Jackal", "function" : "Student" }
索引後會獲得這樣一個映射,這裏只考慮了name
字段的分詞:regexp
segment | document id list |
---|---|
john | 1 |
doeman | 1 |
jane | 2 |
doewoman | 2 |
jimmy | 3 |
jackal | 3 |
那麼如今考慮咱們的搜索htm
GET /my_idx/my_type/_search?q=Doe
standard
分詞器會將Doe
分析爲doe
,而後到索引表中查找,並不會找到doe
這個索引,所以返回空
GET /my_idx/my_type/_search?q=Doeman
standard
分詞器會將Doeman
分析爲doeman
,而後到索引表中找到了該索引,會發現只有doc ID 1
包含該索引,因此只返回一個文檔
GET /my_idx/my_type/_search { "query": { "term": { "name": "Doe" } } }
term
查詢,Doe
仍是Doe
,不會被分析器分析,可是Doe
在索引表中依然是不存在的,因此這個方法也沒法返回任何文檔。
額外說明,題主並無用這種方式試過
GET /my_idx/my_type/_search { "query": { "term": { "name": "Doeman" } } }
不要覺得這樣就能找到了,由於term
不進行分析,因此直接從索引表中找Doeman
也是沒有任何文檔匹配的,除非把Doeman
改成doeman
總結了一下stackoverflow上的答案,目前有這麼幾種可行方案:
GET my_idx/my_type/_search { "query": { "regexp": { "name": "doe.*" } } }
使用query_string
配合通配符進行查詢,須要注意的是,通配符查找可能使用大量內存且效率低下
後綴匹配(前導通配符)
是很是重的操做(e.g. "*ing
"),索引中全部的term
都會被查找一遍,能夠經過allow_leading_wildcard
來關閉後綴匹配
功能
GET my_idx/my_type/_search { "query": { "query_string": { "default_field": "name", "query": "Doe*" } } }
原答案說使用prefix
,可是prefix
並無對查詢進行分析,這裏咱們使用match_phrase_prefix
GET my_idx/my_type/_search { "query": { "match_phrase_prefix": { "name": { "query": "Doe", "max_expansions": 10 } } } }
建立索引
PUT my_idx { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "ngram", "min_gram": 3, "max_gram": 3, "token_chars": [ "letter", "digit" ] } } } } }
測試一下分詞器
POST my_idx/_analyze { "analyzer": "my_analyzer", "text": "Doeman" } // response { "tokens": [ { "token": "Doe", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "oem", "start_offset": 1, "end_offset": 4, "type": "word", "position": 1 }, { "token": "ema", "start_offset": 2, "end_offset": 5, "type": "word", "position": 2 }, { "token": "man", "start_offset": 3, "end_offset": 6, "type": "word", "position": 3 } ] }
再查就能夠查到了。而題主雖然使用了ngram
,可是min_gram
和max_gram
都配置爲1
長度越小,匹配到的文檔越多,但匹配的質量會越差
長度越大,檢索到的文檔越匹配。推薦使用長度爲3的tri-gram
。官方文檔
對此有詳細介紹