ElasticSearch 連載二中文分詞

時間 2019-12-04

標籤 elasticsearch 連載中文分詞欄目日誌分析简体版

原文原文鏈接

上一章ElasticSearch 連載一基礎入門對Elastic的概念、安裝以及基礎操做進行了介紹。html

那是否是有童鞋會有如下幾個問題呢？git

什麼是中文分詞器？
分詞器怎麼安裝？
如何使用中文分詞器？

那麼接下來就爲你們細細道來。github

什麼是中文分詞器

搜索引擎的核心是倒排索引而倒排索引的基礎就是分詞。所謂分詞能夠簡單理解爲將一個完整的句子切割爲一個個單詞的過程。在 es 中單詞對應英文爲 term。咱們簡單看下面例子：shell

我愛北京天安門

ES 的倒排索引便是根據分詞後的單詞建立，即 我、愛、北京、天安門這4個單詞。這也意味着你在搜索的時候也只能搜索這4個單詞才能命中該文檔。json

分詞器安裝

首先，安裝中文分詞插件。這裏使用的是 ik 。segmentfault

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip

上面代碼安裝的是5.5.1版的插件，與 Elastic 5.5.1 配合使用。數據結構

安裝結束後，會發現目錄 /elasticsearch-5.5.1/plugins 多了一個analysis-ik 的文件。app

接着，從新啓動 Elastic ，就會自動加載這個新安裝的插件。curl

最簡單的測試

用下面命令測試一下ik分詞器：elasticsearch

curl -X GET 'http://localhost:9200/_analyze?pretty&analyzer=ik_smart' -d '我愛北京天安門'

返回結果：

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "愛",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "北京",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "天安門",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

那麼恭喜你，完成了ik分詞器的安裝。

如何使用中文分詞器

概念

這裏介紹下什麼是_all字段, 其實_all字段是爲了在不知道搜索哪一個字段時，使用的。ES會把全部的字段（除非你手動設置成false），都放在_all中，而後經過分詞器去解析。當你使用query_string的時候，默認就在這個_all字段上去作查詢，而不須要挨個字段遍歷，節省了時間。

properties中定義了特定字段的分析方式

type，字段的類型爲string，只有string類型才涉及到分詞，像是數字之類的是不須要分詞的。
store，定義字段的存儲方式，no表明不單獨存儲，查詢的時候會從_source中解析。當你頻繁的針對某個字段查詢時，能夠考慮設置成true。
term_vector，定義了詞的存儲方式，with_position_offsets，意思是存儲詞語的偏移位置，在結果高亮的時候有用。
analyzer，定義了索引時的分詞方法
search_analyzer，定義了搜索時的分詞方法
include_in_all，定義了是否包含在_all字段中
boost，是跟計算分值相關的。

添加Index

而後，新建一個 Index，指定須要分詞的字段。這一步根據數據結構而異，下面的命令只針對本文。基本上，凡是須要搜索的中文字段，都要單獨設置一下。

curl -X PUT 'localhost:9200/school' -d '
{
  "mappings": {
    "student": {
        "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
      "properties": {
        "user": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word",
          "include_in_all": "true",
          "boost": 8
        },
        "desc": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word",
          "include_in_all": "true",
          "boost": 10
        }
      }
    }
  }
}'

上面代碼中，首先新建一個名稱爲school的 Index，裏面有一個名稱爲student的 Type。student有三個字段。

user

desc

這兩個字段都是中文，並且類型都是文本（text），因此須要指定中文分詞器，不能使用默認的英文分詞器。

上面代碼中，analyzer是字段文本的分詞器，search_analyzer是搜索詞的分詞器。ik_max_word分詞器是插件ik提供的，能夠對文本進行最大數量的分詞。

數據操做

建立好了Index後，咱們來實際演示下：

新增記錄

curl -X PUT 'localhost:9200/school/student/1' -d '
{
  "user": "許星星",
  "desc": "這是一個不可描述的姓名"
}'

curl -X PUT 'localhost:9200/school/student/2' -d '
{
  "user": "天上的星星",
  "desc": "一閃一閃亮晶晶，爸比會跳舞"
}'

curl -X PUT 'localhost:9200/school/student/3' -d '
{
  "user": "比克大魔王",
  "desc": "拿着水晶棒，亮晶晶的棒棒。"
}'

返回數據：

{
    "_index": "school",
    "_type": "student",
    "_id": "3",
    "_version": 2,
    "result": "updated",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "created": false
}

全文搜索

Elastic 的查詢很是特別，使用本身的查詢語法，要求 GET 請求帶有數據體。

curl 'localhost:9200/school/student/_search'  -d '
{
  "query" : { "match" : { "desc" : "晶晶" }}
}'

上面代碼使用 Match 查詢，指定的匹配條件是desc字段裏面包含"晶晶"這個詞。返回結果以下。

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2.5811603,
        "hits": [
            {
                "_index": "school",
                "_type": "student",
                "_id": "3",
                "_score": 2.5811603,
                "_source": {
                    "user": "比克大魔王",
                    "desc": "拿着水晶棒，亮晶晶的棒棒。"
                }
            },
            {
                "_index": "school",
                "_type": "student",
                "_id": "2",
                "_score": 2.5316024,
                "_source": {
                    "user": "天上的星星",
                    "desc": "一閃一閃亮晶晶，爸比會跳舞"
                }
            }
        ]
    }
}

Elastic 默認一次返回10條結果，能夠經過size字段改變這個設置。

curl 'localhost:9200/school/student/_search'  -d '
{
  "query" : { "match" : { "desc" : "晶晶" }},
  "size" : 1
}'

上面代碼指定，每次只返回一條結果。

還能夠經過from字段，指定位移

curl 'localhost:9200/school/student/_search'  -d '
{
  "query" : { "match" : { "desc" : "晶晶" }},
  "size" : 1,
  "from" : 1
}'

上面代碼指定，從位置1開始（默認是從位置0開始），只返回一條結果。

邏輯運算

若是有多個搜索關鍵字， Elastic 認爲它們是or關係。

curl 'localhost:9200/school/student/_search'  -d '
{
  "query" : { "match" : { "desc" : "水晶棒 這是" }}
}'

返回結果：

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 5.1623206,
        "hits": [
            {
                "_index": "school",
                "_type": "student",
                "_id": "3",
                "_score": 5.1623206,
                "_source": {
                    "user": "比克大魔王",
                    "desc": "拿着水晶棒，亮晶晶的棒棒。"
                }
            },
            {
                "_index": "school",
                "_type": "student",
                "_id": "1",
                "_score": 2.5811603,
                "_source": {
                    "user": "許星星",
                    "desc": "這是一個不可描述的姓名"
                }
            }
        ]
    }
}

若是要執行多個關鍵詞的and搜索，必須使用布爾查詢。

curl 'localhost:9200/school/student/_search'  -d '
{
  "query": {
    "bool": {
      "must": [
        { "match": { "desc": "水晶棒" } },
        { "match": { "desc": "亮晶晶" } }
      ]
    }
  }
}'

返回結果：

{
    "took": 24,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 10.324641,
        "hits": [
            {
                "_index": "school",
                "_type": "student",
                "_id": "3",
                "_score": 10.324641,
                "_source": {
                    "user": "比克大魔王",
                    "desc": "拿着水晶棒，亮晶晶的棒棒。"
                }
            }
        ]
    }
}