elasticsearch 中文分詞插件IK-Analyze

elasticsearch 版本 7.3java

安裝中文分詞插件mysql

插件對應的版本須要和elasticsearch的版本一致git

插件各個版本下載地址github

https://github.com/medcl/elasticsearch-analysis-ik/releases

使用elasticsearch自帶腳本進行安裝 sql

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.0/elasticsearch-analysis-ik-7.3.0.zip

插件jar包安裝在elasticsearch-7.3.0/plugins/analysis-ik下數據庫

插件的配置文件存放在elasticsearch-7.3.0/config/analysis-ik下,在此目錄中存放了許多詞庫,若是咱們想根據本身業務去擴展一些自定義詞庫的話,能夠修改此目錄中的 IKAnalyzer.cfg.xml 文件服務器

例如:app

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 擴展配置</comment>
        <!--用戶能夠在這裏配置本身的擴展字典 -->
        <entry key="ext_dict">custom/mydict.dic;</entry>
         <!--用戶能夠在這裏配置本身的擴展中止詞字典-->
        <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
        <!--用戶能夠在這裏配置遠程擴展字典 -->
        <entry key="remote_ext_dict">http://10.0.11.1:10002/elasticsearch/myDict</entry>
        <!--用戶能夠在這裏配置遠程擴展中止詞字典-->
        <entry key="remote_ext_stopwords">http://10.0.11.1:10002/elasticsearch/stopWordDict</entry>
</properties>

擴展詞庫能夠配置在本地或存放在遠程服務器上elasticsearch

custorm存放在IKAnalyzer.cfg.xml 文件所在目錄中,須要注意的是擴展詞典的文本格式爲 UTF8 編碼測試

配置在遠程詞庫中更新詞庫後不須要重啓,須要在http請求頭中作些設置

  1. 該 http 請求須要返回兩個頭部(header),一個是 Last-Modified,一個是 ETag,這二者都是字符串類型,只要有一個發生變化,該插件就會去抓取新的分詞進而更新詞庫。

  2. 該 http 請求返回的內容格式是一行一個分詞,換行符用 \n 便可。

修改完IKAnalyzer.cfg.xml須要重啓服務

// 建立索引
PUT /full_text_test

// 添加mapping
POST /full_text_test/_mapping
{
  "properties":{
    "content":{
      "type":"text",
      "analyzer":"ik_max_word",
      "search_analyzer":"ik_smart"
    }
  }
}

// 添加一條數據
POST /full_text_test/_doc/1
{
  "content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}

測試分詞效果 

ik_max_word: 會將文本作最細粒度的拆分

ik_smart: 會作最粗粒度的拆分

POST /full_text_test/_analyze
{
  "text": ["中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"],
  "tokenizer": "ik_max_word"
}

結果

{
  "tokens" : [
    {
      "token" : "中國",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "駐",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "洛杉磯",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "領事館",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "領事",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "館",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "遭",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "亞裔",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "男子",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "子槍",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "槍擊",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "嫌犯",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "已",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "CN_CHAR",
      "position" : 12
    },
    {
      "token" : "自首",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 13
    }
  ]
}
POST /full_text_test/_analyze
{
  "text": ["中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"],
  "tokenizer": "ik_smart"
}

結果

{
  "tokens" : [
    {
      "token" : "中國",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "駐",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "洛杉磯",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "領事館",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "遭",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "亞裔",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "男子",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "槍擊",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "嫌犯",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "已",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "CN_CHAR",
      "position" : 9
    },
    {
      "token" : "自首",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 10
    }
  ]
}

實現一個能夠從數據庫管理的詞庫表,方便隨時擴展詞庫

/**
 * elasticsearch ik-analysis 遠程詞庫
 * 一、該 http 請求須要返回兩個頭部(header),一個是 Last-Modified,一個是 ETag,
 * 這二者都是字符串類型,只要有一個發生變化,該插件就會去抓取新的分詞進而更新詞庫。
 * 二、該 http 請求返回的內容格式是一行一個分詞,換行符用 \n 便可。
 */
@RequestMapping("myDict")
public String myDict(HttpServletResponse response) {
    // 從數據庫中查詢當前version
	String version = esDictVersionMapper.selectById(1).getVersion();
    // 設置請求頭中的詞庫版本號
	response.setHeader("Last-Modified", version);
	StringBuilder sb = new StringBuilder();
    // 查出mysql中擴展詞庫表中全部數據,並以\n分隔
	esDictMapper.selectList(null).forEach(item -> sb.append(item.getWord()).append("\n"));
	return sb.toString();
}

常見問題
問題1:"analyzer [ik_max_word] not found for field [content]"
解決辦法:在全部es節點安裝IK後,問題解決。

相關資料

https://github.com/medcl/elasticsearch-analysis-ik

相關文章
相關標籤/搜索