es在對文檔進行倒排索引的須要用分析器(Analyzer)對文檔進行分析、創建索引。從文檔中提取詞元(Token)的算法稱爲分詞器(Tokenizer),在分詞前預處理的算法稱爲字符過濾器(Character Filter),進一步處理詞元的算法稱爲詞元過濾器(Token Filter),最後獲得詞(Term)。這整個分析算法稱爲分析器(Analyzer)。html
其工做流程:web
CharacterFilters
對文檔中的不須要的字符過濾(例如html語言的<br/>等等)Tokenizer
分詞器大段的文本分紅詞(Tokens)(例如能夠空格基準對一句話進行分詞)TokenFilter
在對分完詞的Tokens進行過濾、處理(好比除去英文經常使用的量詞:a,the,或者把去掉英文複數等)_analyze
來看es的分詞是否是符合咱們的預期目標,咱們使用默認的分析器對下面這句話進行分析。結果包括token,起始的偏移量,類型和序號。我目前先只關注token便可。GET /jindouwin_search_group/_analyze { "text": "Her(5) a Black-cats" }
結果:算法
"tokens": [ { "token": "her", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "5", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "a", "start_offset": 7, "end_offset": 8, "type": "<ALPHANUM>", "position": 2 }, { "token": "black", "start_offset": 9, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "cats", "start_offset": 15, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 } ] }
從結果看出,分詞器先去掉了一些無用的符號,再把一句話分爲Her、五、a、Black、cats,在用TokenFilter
過濾大小寫。apache
es中除了standard
標準分析器外,還有english
、stop
、lower
等等。咱們來看下使用english分析器來解析同一句話的效果。markdown
GET /jindouwin_search_group/_analyze { "text": "Her(5) a Black-cats" , "analyzer": "english" } 結果: { { "tokens": [ { "token": "her", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "5", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "black", "start_offset": 9, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "cat", "start_offset": 15, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 } ] } }
能夠明顯的看出,english去掉了一些經常使用詞(a),和把cats的複數形式去掉了。app
固然es的強大之處在於除了內置的分詞器以外,咱們能夠自定義分析器,經過組裝CharacterFilters、Tokenizer、TokenFilter三個不一樣組件來自定義分析器或者可使用別人完成的分析器,最出名的就是ik
中文分詞插件。
除此以外咱們也能夠CharacterFilters、Tokenizer、TokenFilter進行自定義。
關於一些內置的分析器種類,這裏不一一分析,你們能夠在官網進行翻閱。svg
官網示例:
做爲示範,讓咱們一塊兒來建立一個自定義分析器吧,這個分析器能夠作到下面的這些事:測試
"char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } }
"filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } }
咱們的分析器定義用咱們以前已經設置好的自定義過濾器組合了已經定義好的分詞器和過濾器:ui
"analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } }
彙總起來,完整的 建立索引 請求 看起來應該像這樣:atom
PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] }}, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] }}, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] }} }}}
索引被建立之後,使用 analyze API 來 測試這個新的分析器:
GET /my_index1/_analyze { "analyzer":"my_analyzer", "text": "The quick & brown fox" }
拷貝爲 CURL在 SENSE 中查看
下面的縮略結果展現出咱們的分析器正在正確地運行:
{ "tokens": [ { "token": "quick", "start_offset": 4, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "and", "start_offset": 10, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "fox", "start_offset": 18, "end_offset": 21, "type": "<ALPHANUM>", "position": 4 } ] }
這個分析器如今是沒有多大用處的,除非咱們告訴 Elasticsearch在哪裏用上它。咱們能夠像下面這樣把這個分析器應用在一個 string 字段上:
PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "my_analyzer" } } }