涉及到的概念php
Analyzer 通常由三部分構成,character filters、tokenizers、token filters。掌握了 Analyzer 的原理,就能夠根據咱們的應用場景配置 Analyzer。html
Elasticsearch 有10種分詞器(Tokenizer)、31種 token filter,3種 character filter,一大堆配置項。此外,還有還能夠安裝 plugin 擴展功能。這些都是搭建 analyzer 的原材料。java
Analyzer 的內部就是一條流水線git
Elasticsearch 已經默認構造了 8個 Analyzer。若沒法知足咱們的需求,能夠經過「Setting API」構造 Analyzer。github
PUT /my-index/_settings { "index": { "analysis": { "analyzer": { "customHTMLSnowball": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ] } } } } }
以上自定義的 Analyzer名爲 customHTMLSnowball, 表明的含義:算法
移除 html 標籤 (html_strip character filter),好比 <p> <a> <div> 。json
分詞,去除標點符號(standard tokenizer)app
把大寫的單詞轉爲小寫(lowercase token filter)less
過濾停用詞(stop token filter),好比 「the」 「they」 「i」 「a」 「an」 「and」。curl
提取詞幹(snowball token filter,snowball 雪球算法 是提取英文詞幹最經常使用的一種算法。)
cats -> cat
catty -> cat
stemmer -> stem
stemming -> stem
stemmed -> stem
The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>
一圖勝前言,這段文本交給 customHTMLSnowball ,它是這樣處理的。
當咱們的搜索場景爲:英文博文、英文新聞、英文論壇帖等大段的文本時,最好使用包含 stemming token filter 的 analyzer。
常見的 stemming token filter 有這幾種: stemmer, snowball, porter_stem。
拿 snowball token filter 舉例,它把 sing/ sings / singing 都轉化詞幹 sing。而且丟棄了「they」 「are」兩個停用詞。無論用戶搜 sing、sings、singing, 他的搜索結果都是基於「sing」這個term,所得的結果集都同樣。
GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }
詞幹提取在英文搜索種應用普遍,可是也有侷限:
詞幹提取對中文意義不大(毫無心義?)。
搜索專業術語,人名時,詞幹提取反而讓搜索結果變差。
eg: flying fish 與 fly fishing 意思差之千里,但通過 snowball 處理後的他們的詞根(Term)相同 fli fish。
當用戶搜索「假蠅釣魚」信息時,出來的倒是「飛魚」 的結果,搜索結果十分不理想。
此類場景,建議使用精準搜索,採用簡單的分詞策略(不提取詞幹,只 lowercase)+ Fuzzy query 多是更好的選擇。
英文的分詞比較簡單,根據空格,標點符號就能夠分的八九不離十。可是中文詞與詞之間沒有空格,德文偶爾兩個詞會連在一塊兒,使用默認的 standard analyzer 就不靈光了。
> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d '耶穌爬山寶訓' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "穌", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "寶", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "訓", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }
standard analyzer 將「耶穌爬山寶訓」處理爲5個獨立的字,這不太靠譜。比較理想的結果應該爲["耶穌", "爬山寶訓"]。
此時咱們須要藉助一些插件(plugin)來處理中文的分詞。mmseg 是處理中文一個比較靠譜的插件。安裝後能夠引入 mmseg-analyzer,處理中文還不錯。
當咱們搜索用戶名(username),商品分類(category),標籤(tag)時,但願精準搜索。建索引時最好不要再分詞、也不要提取詞幹,徹底能夠跳過 analyzer 這一步。
能夠在某個字段的 mapping 中指定 "index": "not_analyzed",從而直接把原始文本轉爲 term。
先測試ik分詞器的基本功能
POST _analyze?pretty { "analyzer": "ik_smart", "text": "中華人民共和國國歌" }
結果:
{ "tokens": [ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "國歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ] }
能夠看出:經過ik_smart明顯很智能的將 "中華人民共和國國歌"進行了正確的分詞。
另一個例子:
POST _analyze?pretty { "analyzer": "ik_smart", "text": "王者榮耀是最好玩的遊戲" }
結果:
{ "tokens": [ { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "榮耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 2 }, { "token": "好玩", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 3 }, { "token": "遊戲", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 4 } ] }
若是結果跟個人不同,那就對了,中文ik分詞詞庫裏面將「王者榮耀」是分開的,可是咱們又不肯意將其分開,根據github上面的指示能夠配置
IKAnalyzer.cfg.xml 目錄在:elasticsearch-5.4.0/plugins/ik/config
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 擴展配置</comment> <!--用戶能夠在這裏配置本身的擴展字典 --> <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry> <!--用戶能夠在這裏配置本身的擴展中止詞字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用戶能夠在這裏配置遠程擴展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用戶能夠在這裏配置遠程擴展中止詞字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
//TODO
配置完了以後就能夠看到剛纔的結果了
順便測試一下ik_max_word
POST _analyze?pretty { "analyzer": "ik_max_word", "text": "中華人民共和國國歌" } 結果看看就好了 { "tokens": [ { "token": "中華人民共和國", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中華人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中華", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, { "token": "華人", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "人民共和國", "start_offset": 2, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "共和國", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 6 }, { "token": "共和", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 }, { "token": "國", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 8 }, { "token": "國歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 9 } ] }
再看看github上面的一個例子
POST /index/fulltext/_mapping { "fulltext": { "_all": { "analyzer": "ik_smart" }, "properties": { "content": { "type": "text" } } } }
存一些值
POST /index/fulltext/1 { "content": "美國留給伊拉克的是個爛攤子嗎" } POST /index/fulltext/2 { "content": "公安部:各地校車將享最高路權" } POST /index/fulltext/3 { "content": "中韓漁警衝突調查:韓警平均天天扣1艘中國漁船" } POST /index/fulltext/4 { "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首" }
取值
POST /index/fulltext/_search { "query": { "match": { "content": "中國" } } }
結果:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1.0869478, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 1.0869478, "_source": { "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首" } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 0.61094594, "_source": { "content": "中韓漁警衝突調查:韓警平均天天扣1艘中國漁船" } }, { "_index": "index", "_type": "fulltext", "_id": "1", "_score": 0.27179778, "_source": { "content": "美國留給伊拉克的是個爛攤子嗎" } } ] } }
es會按照分詞進行索引,而後根據你的查詢條件按照分數的高低給出結果
官網有一個例子,能夠學習學習:https://github.com/medcl/elasticsearch-analysis-ik
看另外一個有趣的例子
PUT /index1 { "settings": { "refresh_interval": "5s", "number_of_shards" : 1, "number_of_replicas" : 0 }, "mappings": { "_default_":{ "_all": { "enabled": false } }, "resource": { "dynamic": false, "properties": { "title": { "type": "text", "fields": { "cn": { "type": "text", "analyzer": "ik_smart" }, "en": { "type": "text", "analyzer": "english" } } } } } } }
field的做用有二:
1.好比一個string類型能夠映射成text類型來進行全文檢索,keyword類型做爲排序和聚合;
2 至關於起了個別名,使用不一樣的分類器
批量插入值
POST /_bulk { "create": { "_index": "index1", "_type": "resource", "_id": 1 } } { "title": "周星馳最新電影" } { "create": { "_index": "index1", "_type": "resource", "_id": 2 } } { "title": "周星馳最好看的新電影" } { "create": { "_index": "index1", "_type": "resource", "_id": 3 } } { "title": "周星馳最新電影,最好,新電影" } { "create": { "_index": "index1", "_type": "resource", "_id": 4 } } { "title": "最最最最好的新新新新電影" } { "create": { "_index": "index1", "_type": "resource", "_id": 5 } } { "title": "I'm not happy about the foxes" }
取值
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "fox", "fields": "title" } } }
結果
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
緣由,使用title裏面查詢fox,而title使用的是Standard標準分詞器,被索引的是foxes,因此不會有結果,下面這種狀況就會有結果了
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "fox", "fields": "title.en" } } }
結果就不列出來了,由於title.en使用的是english分詞器
對比一下下面的輸出,體會一下field的使用
GET /index1/resource/_search { "query": { "match": { "title.cn": "the最好遊戲" } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "the最新遊戲", "fields": [ "title", "title.cn", "title.en" ] } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "the最新", "fields": "title.cn" } } }
根據結果體會體會用法
下面使用「王者榮耀作測試」,這裏能夠看到前面配置的HotWords.php是一把雙刃劍,將「王者榮耀」放在裏面以後,「王者榮耀」這個詞就是一個總體,不會被切分紅「王者」和「榮耀」,可是就是要搜索王者怎麼辦呢,這裏就體現出fields的強大了,具體看下面
先存入數據
POST /_bulk { "create": { "_index": "index1", "_type": "resource", "_id": 6 } } { "title": "王者榮耀最好玩的遊戲" } { "create": { "_index": "index1", "_type": "resource", "_id": 7 } } { "title": "王者榮耀最好玩的新遊戲" } { "create": { "_index": "index1", "_type": "resource", "_id": 8 } } { "title": "王者榮耀最新遊戲,最好玩,新遊戲" } { "create": { "_index": "index1", "_type": "resource", "_id": 9 } } { "title": "最最最最好的新新新新遊戲" } { "create": { "_index": "index1", "_type": "resource", "_id": 10 } } { "title": "I'm not happy about the foxes" }
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者榮耀", "fields": "title.cn" } } } #下面會沒有結果返回 POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者", "fields": "title.cn" } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者", "fields": "title" } } }
對比結果就能夠一目瞭然了,結果略!
因此一開始業務的需求要至關了解,纔能有好的映射(mapping)被設計,搜索的時候也會省事很多
查看分詞的命令, ES配置完成後須要測試分詞,看看分詞是否達到預期效果。
curl 命令查看:
1. 使用自定義的分析器查看分詞:ansj_index_synonym:自定交分析器名稱. pretty :json格式顯示
curl -XGET 'http://localhost:8200/zh/_analyze?analyzer=ansj_index_synonym&pretty' -d '童裝童鞋'
2. 使用自定義的分詞器(tokenizer)和過濾器(filters)查看分詞:
curl -XGET 'http://localhost:8200/zh/_analyze?tokenizer=ansj_index&filters=synonym&pretty' -d '童裝童鞋'
3. 查詢某個字段的分詞:
curl -XGET 'http://localhost:8200/zh/_analyze?field=brand_name&pretty' -d '童裝童鞋'
「brand_name」:字段名稱,若是是字段是nest,object類型,也能夠寫成"brand_name. name"
除了自定義本身的分析器,ES本身也有內置分析器如:
standard
simple
whitespace
stop
keyword
pattern
language
snowball
custom
須要英文好點在同鞋。
ES還內置了分詞器和過濾器:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.htmlstandard
edge_ngram
keyword
letter
lowercase
ngram
whitespace
pattern
uax_email_url
path_hierarchy
ascii folding
length
lowercase
uppercase
nGram
edge_ngram
porter_stem
shingle
stop
word_delimiter
stemmer
stemmer_override
keyword_marker
keyword_repeat
kstem
snowball
phonetic
synonym
reverse
elision
truncate
unique
pattern_capture
pattern_replace
trim
limit
hunspell
common_grams
normalization
delimited_payload
keep_words
參考:
https://github.com/medcl/elasticsearch-analysis-ik
http://keenwon.com/1404.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output