ElasticSearch 中文分詞搜索環境搭建

時間 2019-11-06

原文原文鏈接

ElasticSearch 是強大的搜索工具，而且是ELK套件的重要組成部分node

好記性不如亂筆頭，此次是在windows環境下搭建es中文分詞搜索測試環境，步驟以下git

一、安裝jdk1.8，配置好環境變量github

二、下載ElasticSearch7.1.1，版本變化比較快，剛纔看了下最新版已是7.2.0，本環境基於7.1.1搭建，下載地址https://www.elastic.co/cn/downloads/elasticsearch，獲得一個zip壓縮包，解壓縮後cmd下運行下面的命令便可啓動ESnpm

./bin/elasticsearch.bat

正常啓動的話提示符下回輸出一些日誌記錄json

瀏覽器中輸入http://localhost:9200/測試服務是否可以正常訪問，正常狀況會顯示下面的概要信息，說明ES搭建成功windows

三、ElasticSearch 雖然提供了強大Restful接口，但沒有一個UI界面操做起來不是很直觀，elasticsearch-head很好的解決這個問題，elasticsearch-head是基於node的一個工具，經過鏈接ES服務提供可視化展現界面，詳細參考：瀏覽器

https://github.com/mobz/elasticsearch-head，安裝步驟也是很簡單，以下app

git clone git://github.com/mobz/elasticsearch-head.git
cd elasticsearch-head
npm install
npm run start

服務正常啓動後顯示界面以下curl

瀏覽器中輸入http://localhost:9100/能夠看到對應UIelasticsearch

四、中文分詞插件詳細介紹見https://github.com/medcl/elasticsearch-analysis-ik，注意版本不要選錯，不然會按照失敗，es7.1.1選擇對應版本，安裝步驟以下：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.1/elasticsearch-analysis-ik-7.1.1.zip

五、測試中文分詞檢索功能，先創建索引，在postman或者elasticsearch-head中發送以下請求

--建立索引
curl -XPUT http://localhost:9200/news 

--索引中添加數據
curl -XPOST http://localhost:9200/news/_create/1 -H 'Content-Type:application/json' -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'

添加的數據以下

添加索引映射

curl -XPOST http://localhost:9200/news/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }

}'

ik_max_word ik_smart二者的區別

ik_max_word: 會將文本作最細粒度的拆分，好比會將「中華人民共和國國歌」拆分爲「中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌」，會窮盡各類可能的組合，適合 Term Query；

ik_smart: 會作最粗粒度的拆分，好比會將「中華人民共和國國歌」拆分爲「中華人民共和國,國歌」，適合 Phrase 查詢。

測試示例：

http://localhost:9200/_analyze，經過ik_max_word分詞，結果以下

輸入

{"text":"中華人民共和國人民大會堂","analyzer":"ik_max_word" }

輸出

{
    "tokens": [
        {
            "token": "中華人民共和國",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中華人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中華",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "華人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和國",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和國",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "國人",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "人民大會堂",
            "start_offset": 7,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "人民大會",
            "start_offset": 7,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "人民",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "大會堂",
            "start_offset": 9,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 12
        },
        {
            "token": "大會",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 13
        },
        {
            "token": "會堂",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 14
        }
    ]
}

若是輸入

{"text":"中華人民共和國人民大會堂","analyzer":"ik_smart" }

輸出

{
    "tokens": [
        {
            "token": "中華人民共和國",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "人民大會堂",
            "start_offset": 7,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

根據分詞檢索輸入語法，請求url：http://localhost:9200/news/_search

輸入：

{
    "query" : { "match" : { "content" : "中華人民共和國國歌" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

輸出：

{
    "took": 11,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.6810182,
        "hits": [
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "6",
                "_score": 1.6810182,
                "_source": {
                    "content": "中華民族國歌"
                },
                "highlight": {
                    "content": [
                        "<tag1>中華</tag1>民族<tag1>國歌</tag1>"
                    ]
                }
            },
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "5",
                "_score": 0.9426802,
                "_source": {
                    "content": "人民公社"
                },
                "highlight": {
                    "content": [
                        "<tag1>人民</tag1>公社"
                    ]
                }
            }
        ]
    }
}