ES內置的token filter不少,大部分實際工做中都用不到。這段時間準備ES認證工程師的考試,備考的時候須要熟悉這些不經常使用的filter。ES官方對一些filter只是一筆帶過,我就想着把備考的筆記整理成博客備忘,也但願能幫助到有這方面需求的人。算法
官方解釋:測試
A token filter of type length that removes words that are too long or too short for the stream.
這個filter的功能是,去掉過長或者太短的單詞。它有兩個參數能夠設置:this
Integer.MAX_VALUE
先來簡單測試下它的效果,搜索引擎
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone" }
輸出:spa
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 } ] }
能夠看到大於3的單詞都被過濾掉了。code
若是要給某個索引指定length
filer,能夠參考下面這個示例:索引
PUT /length_example { "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 3 } } } } } GET length_example/_analyze { "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet" }
ngram filter的意義能夠參考ngram tokenize
,後者至關因而keyword
tokenizer 加上 ngram filter
,效果是同樣的。token
它的含義是:首先將text文本切分,執行時採用N-gram切割算法。N-grams
算法,像一個穿越單詞的滑窗,是一個特定長度的持續的字符序列。rem
說着挺抽象,來個例子:文檔
GET _analyze { "tokenizer": "ngram", "text": "北京大學" } GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "ngram", "min_gram":1, "max_gram":2 }], "text" : "北京大學" }
能夠看到有兩個屬性,
max和min的間隔,也就是步長默認最大隻能是1,能夠經過設置索引的max_ngram_diff
修改,示例以下:
PUT /ngram_example { "settings" : { "index": { "max_ngram_diff": 10 }, "analysis" : { "analyzer" : { "default" : { "tokenizer" : "keyword", "filter" : ["my_ngram"] } }, "filter" : { "my_ngram" : { "type" : "ngram", "min_gram" : 2, "max_gram": 4 } } } } }
使用索引的analyzer測試,
GET ngram_example/_analyze { "analyzer": "default", "text" : "北京大學" }
輸出,
{ "tokens" : [ { "token" : "北京", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大學", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大學", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "大學", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ] }
你應該已經基本瞭解ngram filter
的用法了,可能會有個疑問,這個過濾器用在什麼場景呢?事實上,它適合前綴中綴檢索,好比搜索推薦功能,當你只輸入了某個句子的一部分時,搜索引擎會顯示出以這部分爲前綴的一些匹配項,從而實現推薦功能。
這個filter從名字也能夠看出它的功能,它能夠刪除先後空格。看個示例:
GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "trim"}], "text" : " 北京大學" }
輸出,
{ "tokens" : [ { "token" : " 北京大學", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 } ] }
這個filter有一個length
屬性,能夠截斷分詞後的term,確保term的長度不會超過length。下面看個示例,
GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "truncate", "length": 3}], "text" : "北京大學" }
輸出,
{ "tokens" : [ { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ] }
再來一個示例:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
輸出,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "QUI", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 }, ...
這個filter在keyword比較長的場景下,能夠用來避免出現一些OOM等問題。
unique詞元過濾器的做用就是保證一樣結果的詞元只出現一次。看個示例:
GET _analyze { "tokenizer": "standard", "filter": ["unique"], "text": "this is a test test test" }
輸出,
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 3 } ] }
同義詞過濾器。它的使用場景是這樣的,好比有一個文檔裏面包含番茄
這個詞,咱們但願搜索番茄
或者西紅柿
,聖女果
均可以找到這個文檔。示例以下:
PUT /synonym_example { "settings": { "analysis" : { "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["my_synonym"] } }, "filter" : { "my_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
咱們須要在ES實例的config目錄下,新建一個analysis/synonym.txt
的文件,內容以下:
番茄,西紅柿,聖女果
記得要重啓。
而後測試下,
GET /synonym_example/_analyze { "analyzer": "synonym", "text": "番茄" }
輸出,
{ "tokens" : [ { "token" : "番茄", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "西紅柿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "聖女果", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 } ] }
咱們知道一個分析器能夠包含多個過濾器,那怎麼來實現呢?看下面這個例子:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":4 },{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
這個例子中,咱們把length filter和truncate filter組合在一塊兒使用,它首先基於標準分詞,分詞後的term大於4字節的會首先被過濾掉,接着剩下的term會被截斷到3個字節。輸出結果是,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "ove", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "laz", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "bon", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ] }
若是是在索引中使用的話,參考下面這個例子:
PUT /length_truncate_example { "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length", "my_truncate"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 4 }, "my_truncate" : { "type" : "truncate", "length": 3 } } } } } GET length_truncate_example/_analyze { "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet" }