Kafka:ZK+Kafka+Spark Streaming集羣環境搭建(十九)ES6.2.2 安裝Ik中文分詞器

注: elasticsearch 版本6.2.2html

  • 1)集羣模式,則每一個節點都須要安裝ik分詞,安裝插件完畢後須要重啓服務,建立mapping前若是有機器未安裝分詞,則可能該索引可能爲RED,須要刪除後重建。
    •   
      域名            ip
      master         192.168.0.120
      slave1         192.168.0.121
      slave2         192.168.0.122
  • 2)Elasticsearch 內置的分詞器對中文不友好,會把中文分紅單個字來進行全文檢索,不能達到想要的結果,在全文檢索及新詞發展如此快的互聯網時代,IK能夠進行友好的分詞及自定義分詞。
  • 3)IK Analyzer是一個開源的,基於java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版,目前支持最新版本的ES6.X版本。
  • 4)IK 帶有兩個分詞器:
    •   ik_max_word :會將文本作最細粒度的拆分;儘量多的拆分出詞語
    •   ik_smart:會作最粗粒度的拆分;已被分出的詞語將不會再次被其它詞語佔有
  • 5)本篇採用下載IK中文分詞器源碼後,使用eclipse編譯源碼方式獲得IK分詞器安裝包,由於之後若是要進行修改IK分詞器能夠修改完源碼本身進行打包安裝。

第一步:下載IK中文分詞器源代碼

在github中搜索ik,找到"medcl/elasticsearch-analysis-ik",並找到https://github.com/medcl/elasticsearch-analysis-ik/releases,選擇本身須要的版本:java

或者git


如上圖所示,選擇和elsticsearch 匹配的版本,並下載zip包。github

第二步:解壓下載的zip包並使用ecplise打開

①解壓elasticsearch-analysis-ik-6.2.2.zipjson

②打開eclispe導入 maven項目服務器

下一步微信

③導入後,使用maven build...編譯jar包app

彈出編輯框:eclipse

點擊「Run」執行curl

完成後,在target文件夾上右鍵 選擇 Refresh,如圖所示:

第三步:分別上傳到ES的服務器並分別解壓安裝

把編譯好的jar包上傳到master服務器上,

執行命令安裝:

[spark@master ~]$ cd /opt/
[spark@master opt]$ unzip elasticsearch-analysis-ik-6.2.2.zip -d /opt/elasticsearch-6.2.2/plugins/
Archive:  elasticsearch-analysis-ik-6.2.2.zip
   creating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/plugin-descriptor.properties  
   creating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/extra_main.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/extra_single_word.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/extra_single_word_full.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/extra_single_word_low_freq.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/extra_stopword.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/IKAnalyzer.cfg.xml  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/main.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/preposition.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/quantifier.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/stopword.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/suffix.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/config/surname.dic  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/elasticsearch-analysis-ik-6.2.2.jar  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/httpclient-4.5.2.jar  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/httpcore-4.4.4.jar  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/commons-logging-1.2.jar  
  inflating: /opt/elasticsearch-6.2.2/plugins/elasticsearch/commons-codec-1.9.jar  
[spark@master opt]$ cd /opt/elasticsearch-6.2.2/plugins/
[spark@master plugins]$ mv elasticsearch/ ik/

slave1,slave2一樣安裝,這裏省略。。

master,slave1,slave2三臺服務器安裝完成後,重啓elasticsearch 便可加載ik分詞器。

第四步:測試

1) 刪除、建立索引:

curl -Xdelete "http://192.168.0.120:9200/index"

curl -Xput "http://192.168.0.120:9200/index"

2)使用index索引建立mapping(對字段‘content’進行中文分詞):

curl -XPOST "http://192.168.0.120:9200/index/fulltext/_mapping" -H 'Content-Type: application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }
    
}'

3)先添加4條記錄:

curl -XPOST "http://192.168.0.120:9200/index/fulltext/1" -H 'Content-Type: application/json' -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}'

curl -XPOST "http://192.168.0.120:9200/index/fulltext/2" -H 'Content-Type: application/json' -d'
{"content":"公安部:各地校車將享最高路權"}'

curl -XPOST "http://192.168.0.120:9200/index/fulltext/3" -H 'Content-Type: application/json' -d'
{"content":"中韓漁警衝突調查:韓警平均天天扣1艘中國漁船"}'

curl -XPOST "http://192.168.0.120:9200/index/fulltext/4" -H 'Content-Type: application/json' -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}'

4)執行統計:

curl -XPOST "http://192.168.0.120:9200/index/fulltext/_search" -H 'Content-Type: application/json' -d'
{
    "query" : { "match" : { "content" : "中國" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

返回結果:

{
  "took": 133,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.6489038,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.6489038,
        "_source": {
          "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "content": "中韓漁警衝突調查:韓警平均天天扣1艘中國漁船"
        },
        "highlight": {
          "content": [
            "中韓漁警衝突調查:韓警平均天天扣1艘<tag1>中國</tag1>漁船"
          ]
        }
      }
    ]
  }
}

5)再添加3條記錄:

curl -XPOST "http://192.168.0.120:9200/index/fulltext/5" -H 'Content-Type: application/json' -d'
{"content":"俄偵委:俄一輛卡車渡河時翻車 致2名中國遊客遇難"}'
 
curl -XPOST "http://192.168.0.120:9200/index/fulltext/6" -H 'Content-Type: application/json' -d'
{"content":"韓國銀行面向中國留學生推出微信支付服務"}'

curl -XPOST "http://192.168.0.120:9200/index/fulltext/7" -H 'Content-Type: application/json' -d'
{"content":"印媒:中國東北「鏽帶」在困境中反擊"}'

6)從新執行統計:

curl -XPOST "http://192.168.0.120:9200/index/fulltext/_search" -H 'Content-Type: application/json' -d'
{
    "query" : { "match" : { "content" : "中國" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

返回結果:

{
  "took": 41,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.6785375,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "7",
        "_score": 0.6785375,
        "_source": {
          "content": "印媒:中國東北「鏽帶」在困境中反擊"
        },
        "highlight": {
          "content": [
            "印媒:<tag1>中國</tag1>東北「鏽帶」在困境中反擊"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "6",
        "_score": 0.47000363,
        "_source": {
          "content": "韓國銀行面向中國留學生推出微信支付服務"
        },
        "highlight": {
          "content": [
            "韓國銀行面向<tag1>中國</tag1>留學生推出微信支付服務"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.44000342,
        "_source": {
          "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中國</tag1>駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "5",
        "_score": 0.2876821,
        "_source": {
          "content": "俄偵委:俄一輛卡車渡河時翻車 致2名中國遊客遇難"
        },
        "highlight": {
          "content": [
            "俄偵委:俄一輛卡車渡河時翻車 致2名<tag1>中國</tag1>遊客遇難"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "content": "中韓漁警衝突調查:韓警平均天天扣1艘中國漁船"
        },
        "highlight": {
          "content": [
            "中韓漁警衝突調查:韓警平均天天扣1艘<tag1>中國</tag1>漁船"
          ]
        }
      }
    ]
  }
}

IK支持自定義配置詞庫,配置文件在config文件夾下的analysis-ik/IKAnalyzer.cfg.xml,字典文件也在同級目錄下,能夠支持多個選項的配置,ext_dict-自定義詞庫,ext_stopwords-屏蔽詞庫。

同時還支持熱更新配置,配置remote_ext_dict爲http地址,輸入一行一個詞語,注意文檔格式要爲UTF8無BOM格式,若是詞庫發生更新,只須要更新response header中任意一個字段Last-Modified或ETag便可。

[spark@master config]$ pwd
/opt/elasticsearch-6.2.2/plugins/ik/config
[spark@master config]$ more IKAnalyzer.cfg.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 擴展配置</comment>
        <!--用戶能夠在這裏配置本身的擴展字典 -->
        <entry key="ext_dict"></entry>
         <!--用戶能夠在這裏配置本身的擴展中止詞字典-->
        <entry key="ext_stopwords"></entry>
        <!--用戶能夠在這裏配置遠程擴展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用戶能夠在這裏配置遠程擴展中止詞字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
[spark@master config]$ 

 

參考:

《https://blog.csdn.net/moxiong3212/article/details/79338586》

《https://www.cnblogs.com/gaoxu387/p/7889626.html》

相關文章
相關標籤/搜索