elasticsearch 版本 7.3java
安裝中文分詞插件mysql
插件對應的版本須要和elasticsearch的版本一致git
插件各個版本下載地址github
https://github.com/medcl/elasticsearch-analysis-ik/releases
使用elasticsearch自帶腳本進行安裝 sql
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.0/elasticsearch-analysis-ik-7.3.0.zip
插件jar包安裝在elasticsearch-7.3.0/plugins/analysis-ik下數據庫
插件的配置文件存放在elasticsearch-7.3.0/config/analysis-ik下,在此目錄中存放了許多詞庫,若是咱們想根據本身業務去擴展一些自定義詞庫的話,能夠修改此目錄中的 IKAnalyzer.cfg.xml 文件服務器
例如:app
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 擴展配置</comment> <!--用戶能夠在這裏配置本身的擴展字典 --> <entry key="ext_dict">custom/mydict.dic;</entry> <!--用戶能夠在這裏配置本身的擴展中止詞字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用戶能夠在這裏配置遠程擴展字典 --> <entry key="remote_ext_dict">http://10.0.11.1:10002/elasticsearch/myDict</entry> <!--用戶能夠在這裏配置遠程擴展中止詞字典--> <entry key="remote_ext_stopwords">http://10.0.11.1:10002/elasticsearch/stopWordDict</entry> </properties>
擴展詞庫能夠配置在本地或存放在遠程服務器上elasticsearch
custorm存放在IKAnalyzer.cfg.xml 文件所在目錄中,須要注意的是擴展詞典的文本格式爲 UTF8 編碼測試
配置在遠程詞庫中更新詞庫後不須要重啓,須要在http請求頭中作些設置
該 http 請求須要返回兩個頭部(header),一個是 Last-Modified
,一個是 ETag
,這二者都是字符串類型,只要有一個發生變化,該插件就會去抓取新的分詞進而更新詞庫。
該 http 請求返回的內容格式是一行一個分詞,換行符用 \n
便可。
修改完IKAnalyzer.cfg.xml須要重啓服務
// 建立索引 PUT /full_text_test // 添加mapping POST /full_text_test/_mapping { "properties":{ "content":{ "type":"text", "analyzer":"ik_max_word", "search_analyzer":"ik_smart" } } } // 添加一條數據 POST /full_text_test/_doc/1 { "content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首" }
測試分詞效果
ik_max_word: 會將文本作最細粒度的拆分
ik_smart: 會作最粗粒度的拆分
POST /full_text_test/_analyze { "text": ["中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"], "tokenizer": "ik_max_word" } 結果 { "tokens" : [ { "token" : "中國", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "駐", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "洛杉磯", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "領事館", "start_offset" : 6, "end_offset" : 9, "type" : "CN_WORD", "position" : 3 }, { "token" : "領事", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 4 }, { "token" : "館", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 5 }, { "token" : "遭", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "亞裔", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 7 }, { "token" : "男子", "start_offset" : 12, "end_offset" : 14, "type" : "CN_WORD", "position" : 8 }, { "token" : "子槍", "start_offset" : 13, "end_offset" : 15, "type" : "CN_WORD", "position" : 9 }, { "token" : "槍擊", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 10 }, { "token" : "嫌犯", "start_offset" : 17, "end_offset" : 19, "type" : "CN_WORD", "position" : 11 }, { "token" : "已", "start_offset" : 19, "end_offset" : 20, "type" : "CN_CHAR", "position" : 12 }, { "token" : "自首", "start_offset" : 20, "end_offset" : 22, "type" : "CN_WORD", "position" : 13 } ] }
POST /full_text_test/_analyze { "text": ["中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"], "tokenizer": "ik_smart" } 結果 { "tokens" : [ { "token" : "中國", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "駐", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "洛杉磯", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "領事館", "start_offset" : 6, "end_offset" : 9, "type" : "CN_WORD", "position" : 3 }, { "token" : "遭", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 4 }, { "token" : "亞裔", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 5 }, { "token" : "男子", "start_offset" : 12, "end_offset" : 14, "type" : "CN_WORD", "position" : 6 }, { "token" : "槍擊", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 7 }, { "token" : "嫌犯", "start_offset" : 17, "end_offset" : 19, "type" : "CN_WORD", "position" : 8 }, { "token" : "已", "start_offset" : 19, "end_offset" : 20, "type" : "CN_CHAR", "position" : 9 }, { "token" : "自首", "start_offset" : 20, "end_offset" : 22, "type" : "CN_WORD", "position" : 10 } ] }
實現一個能夠從數據庫管理的詞庫表,方便隨時擴展詞庫
/** * elasticsearch ik-analysis 遠程詞庫 * 一、該 http 請求須要返回兩個頭部(header),一個是 Last-Modified,一個是 ETag, * 這二者都是字符串類型,只要有一個發生變化,該插件就會去抓取新的分詞進而更新詞庫。 * 二、該 http 請求返回的內容格式是一行一個分詞,換行符用 \n 便可。 */ @RequestMapping("myDict") public String myDict(HttpServletResponse response) { // 從數據庫中查詢當前version String version = esDictVersionMapper.selectById(1).getVersion(); // 設置請求頭中的詞庫版本號 response.setHeader("Last-Modified", version); StringBuilder sb = new StringBuilder(); // 查出mysql中擴展詞庫表中全部數據,並以\n分隔 esDictMapper.selectList(null).forEach(item -> sb.append(item.getWord()).append("\n")); return sb.toString(); }
常見問題
問題1:"analyzer [ik_max_word] not found for field [content]"
解決辦法:在全部es節點安裝IK後,問題解決。
相關資料