Elasticsearch 2.3.3 中文IK分詞插件安裝

咱們知道搜索引擎接收搜索請求的第一步,就是對要查詢的內容作作分詞,Elasticsearch 2.3.3像其餘搜索引擎同樣,默認的標準分詞器(standard)並不適合中文, 咱們經常使用的中文分詞插件是IK Analysis 分詞器。本文,咱們就介紹IK Analysis分詞插件的安裝。html

在未安裝IK分詞以前,咱們看一下使用standard分詞的效果,git

啓動以前安裝好的ES,在瀏覽器的地址欄中輸入下面的代碼程序員

http://192.168.133.134:9200/hotel/_analyze?analyzer=standard&text=58碼農,我幫碼農,咱們爲程序員的匠心精神服務!

咱們看到分詞的效果以下:github

{
    "tokens": [
        {
            "token": "58",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "碼",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "農",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "我",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "幫",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "碼",
            "start_offset": 7,
            "end_offset": 8,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "農",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "我",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "們",
            "start_offset": 11,
            "end_offset": 12,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "爲",
            "start_offset": 12,
            "end_offset": 13,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
            "token": "程",
            "start_offset": 13,
            "end_offset": 14,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        },
        {
            "token": "序",
            "start_offset": 14,
            "end_offset": 15,
            "type": "<IDEOGRAPHIC>",
            "position": 11
        },
        {
            "token": "員",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 12
        },
        {
            "token": "的",
            "start_offset": 16,
            "end_offset": 17,
            "type": "<IDEOGRAPHIC>",
            "position": 13
        },
        {
            "token": "匠",
            "start_offset": 17,
            "end_offset": 18,
            "type": "<IDEOGRAPHIC>",
            "position": 14
        },
        {
            "token": "心",
            "start_offset": 18,
            "end_offset": 19,
            "type": "<IDEOGRAPHIC>",
            "position": 15
        },
        {
            "token": "精",
            "start_offset": 19,
            "end_offset": 20,
            "type": "<IDEOGRAPHIC>",
            "position": 16
        },
        {
            "token": "神",
            "start_offset": 20,
            "end_offset": 21,
            "type": "<IDEOGRAPHIC>",
            "position": 17
        },
        {
            "token": "服",
            "start_offset": 21,
            "end_offset": 22,
            "type": "<IDEOGRAPHIC>",
            "position": 18
        },
        {
            "token": "務",
            "start_offset": 22,
            "end_offset": 23,
            "type": "<IDEOGRAPHIC>",
            "position": 19
        }
    ]
}

   咱們看到基本上是逐個字符的分詞,並無把一些詞語分在一塊兒。咱們最後在安裝完IK分詞之後再看一下效果。apache

IK Analysis 分詞插件的安裝其實很簡單,可是因爲大多數狀況下須要採用源碼的方式安裝,致使不少朋友安裝失敗。接下來,我就把安裝源碼安裝的方式描述一下。json

1、 Maven安裝瀏覽器

IK Analysis  是基於JAVA編寫的,咱們採用源碼安裝的話,須要安裝maven環境。那咱們先來介紹一下maven環境的安裝。bash

1. 獲取maven包。less

wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz

獲取安裝包,我把maven放在了/usr/local 目錄下面。elasticsearch

2. 解壓

tar -xvf apache-maven-3.3.9-bin.tar.gz

3. 設置環境變量

vi /etc/profile

而後再文件的末尾粘貼上下面三個變量。

MAVEN_HOME=/usr/local/apache-maven-3.3.9
export MAVEN_HOME
export PATH=${PATH}:${MAVEN_HOME}/bin

保存完成後,刷新環境變量

source /etc/profile

4. 驗證

mvn -version

看到這些內容,表示maven安裝成功。

2、 安裝git

採用yum安裝便可

yum install git

3、下載ik源碼

git clone https://github.com/medcl/elasticsearch-analysis-ik

我把他放到了/usr/es/ik這個目錄下面

4、編譯並打包

這個過程會下載許多依賴的包。因此會耽誤一些時間。執行的命令以下:

進入elasticsearch-analysis-ik 目錄,而後執行下面的命令

mvn clean

執行清除命令之後,在執行編譯命令,這個命令須要的時間更多。

mvn compile

最後執行打包命令

mvn package

打包完成之後,咱們能夠再target目錄看到打好的包。

5、複製並解壓elasticsearch-analysis-ik-1.9.3.zip

執行下面的命令便可

unzip  /usr/es/ik/elasticsearch-analysis-ik/target/releases/elasticsearch-analysis-ik-1.9.3.zip -d  /usr/es/plugins/ik

解壓完成後,咱們能夠再/usr/es/plugins/ik中看到咱們解壓的文件。

6、從新啓動

咱們看到標紅線的內容即時導入了IK分詞。咱們來試一下分詞的效果。

在瀏覽器器的地址欄中輸入下面的內容:

http://192.168.133.134:9200/hotel/_analyze?analyzer=ik&text=58碼農,我幫碼農,咱們爲程序員的匠心精神服務!

跟前面的對比一下,僅僅是採用的分詞不同,前面採用的是standard,而本次咱們採用的是ik,咱們能夠看到結果是下面的樣子。

{
    "tokens": [
        {
            "token": "58",
            "start_offset": 0,
            "end_offset": 2,
            "type": "ARABIC",
            "position": 0
        },
        {
            "token": "碼",
            "start_offset": 2,
            "end_offset": 3,
            "type": "COUNT",
            "position": 1
        },
        {
            "token": "農",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "我",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "幫",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "碼",
            "start_offset": 7,
            "end_offset": 8,
            "type": "CN_CHAR",
            "position": 5
        },
        {
            "token": "農",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "咱們",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "爲",
            "start_offset": 12,
            "end_offset": 13,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "程序員",
            "start_offset": 13,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "程序",
            "start_offset": 13,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "序",
            "start_offset": 14,
            "end_offset": 15,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "員",
            "start_offset": 15,
            "end_offset": 16,
            "type": "CN_CHAR",
            "position": 12
        },
        {
            "token": "匠心",
            "start_offset": 17,
            "end_offset": 19,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "匠",
            "start_offset": 17,
            "end_offset": 18,
            "type": "CN_WORD",
            "position": 14
        },
        {
            "token": "心",
            "start_offset": 18,
            "end_offset": 19,
            "type": "CN_CHAR",
            "position": 15
        },
        {
            "token": "精神",
            "start_offset": 19,
            "end_offset": 21,
            "type": "CN_WORD",
            "position": 16
        },
        {
            "token": "服務",
            "start_offset": 21,
            "end_offset": 23,
            "type": "CN_WORD",
            "position": 17
        }
    ]
}

咱們看到的結果是:程序員、程序、精神、服務被做爲詞分出來了。這就是咱們本文介紹的IK分詞的安裝。

那麼,咱們想一下,如何把「58碼農」做爲一個詞可以分出來呢?你們能夠觀看 數航學院的在線視頻進行學習(免費)  同時加羣能夠諮詢ES相關問題

另外,關於IK分詞的其餘內容,你們也能夠看一下這篇介紹(英文):https://github.com/medcl/elasticsearch-analysis-ik

相關文章
相關標籤/搜索