在Elasticsearch中,內置了不少分詞器(analyzers),但默認的分詞器對中文的支持都不是太好。因此須要單獨安裝插件來支持,比較經常使用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果仍是不錯的,可是目前IKAnanlyzer還不支持最新的Elasticsearch2.2.0版本,可是smartcn中文分詞器默認官方支持,它提供了一箇中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。可是smartcn不支持自定義詞庫,做爲測試可先用一下。後面的部分介紹如何支持最新的版本。java
安裝分詞:plugin install analysis-smartcngit
卸載:plugin remove analysis-smartcngithub
測試:安全
請求:POST http://127.0.0.1:9200/_analyze/elasticsearch
{ "analyzer": "smartcn", "text": "聯想是全球最大的筆記本廠商" }
返回結果:測試
{ "tokens": [ { "token": "聯想", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "word", "position": 2 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "word", "position": 3 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "word", "position": 4 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "word", "position": 5 }, { "token": "筆記本", "start_offset": 8, "end_offset": 11, "type": "word", "position": 6 }, { "token": "廠商", "start_offset": 11, "end_offset": 13, "type": "word", "position": 7 } ] }
做爲對比,咱們看一下標準的分詞的結果,在請求中巴smartcn,換成standard編碼
而後看返回結果:
spa
{ "tokens": [ { "token": "聯", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "想", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "是", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "全", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "球", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "大", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "的", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "筆", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "記", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "本", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "廠", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "商", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 12 } ] }
從中能夠看出,基本上不能使用,就是一個漢字變成了一個詞了。插件
本文由賽克 藍德(secisland)原創,轉載請標明做者和出處。code
目前github上最新的版本只支持Elasticsearch2.1.1,路徑爲https://github.com/medcl/elasticsearch-analysis-ik。但如今最新的Elasticsearch已經到2.2.0了因此要通過處理一下才能支持。
一、下載源碼,下載完後解壓到任意目錄,而後修改elasticsearch-analysis-ik-master目錄下的pom.xml文件。找到<elasticsearch.version>行,而後把後面的版本號修改爲2.2.0。
二、編譯代碼mvn package。
三、編譯完成後會在target\releases生成elasticsearch-analysis-ik-1.7.0.zip文件。
四、解壓文件到Elasticsearch/plugins目錄下。
五、修改配置文件增長一行:index.analysis.analyzer.ik.type : "ik"
六、重啓Elasticsearch。
測試:和上面的請求同樣,只是把分詞替換成ik
返回的結果:
{ "tokens": [ { "token": "聯想", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "全球", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 1 }, { "token": "最大", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 2 }, { "token": "筆記本", "start_offset": 8, "end_offset": 11, "type": "CN_WORD", "position": 3 }, { "token": "筆記", "start_offset": 8, "end_offset": 10, "type": "CN_WORD", "position": 4 }, { "token": "筆", "start_offset": 8, "end_offset": 9, "type": "CN_WORD", "position": 5 }, { "token": "記", "start_offset": 9, "end_offset": 10, "type": "CN_CHAR", "position": 6 }, { "token": "本廠", "start_offset": 10, "end_offset": 12, "type": "CN_WORD", "position": 7 }, { "token": "廠商", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 8 } ] }
從中能夠看出,兩個分詞器分詞的結果仍是有區別的。
擴展詞庫,在config\ik\custom下在mydict.dic中增長鬚要的詞組,而後重啓Elasticsearch,須要注意的是文件編碼是UTF-8 無BOM格式編碼。
好比增長了賽克藍德單詞。而後再次查詢:
請求:POST http://127.0.0.1:9200/_analyze/
參數:
{ "analyzer": "ik", "text": "賽克藍德是一家數據安全公司" }
返回結果:
{ "tokens": [ { "token": "賽克藍德", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "克", "start_offset": 1, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "藍", "start_offset": 2, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "德", "start_offset": 3, "end_offset": 4, "type": "CN_CHAR", "position": 3 }, { "token": "一家", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "一", "start_offset": 5, "end_offset": 6, "type": "TYPE_CNUM", "position": 5 }, { "token": "家", "start_offset": 6, "end_offset": 7, "type": "COUNT", "position": 6 }, { "token": "數據", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "安全", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 8 }, { "token": "公司", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 9 } ] }
從上面的結果能夠看出已經支持賽克藍德單詞了。
賽克藍德(secisland)後續會逐步對Elasticsearch的最新版本的各項功能進行分析,近請期待。也歡迎加入secisland公衆號進行關注。