elastcisearch分詞那些事

時間 2019-12-10

原文原文鏈接

準備今天的操做

刪除以前的實驗索引git

curl -XDELETE http://127.0.0.1:9200/synctest/article

output:
{"acknowledged":true}
複製代碼

建立新mappinggithub

curl -XPUT 'http://127.0.0.1:9200/servcie/_mapping/massage' -d ' { "massage":{ "properties":{ "location":{ "type":"geo_point" }, "name":{ "type":"string" }, "age":{ "type":"integer" }, "address":{ "type":"string" }, "price":{ "type":"double", "index":"not_analyzed" }, "is_open":{ "type":"boolean" } } } }'
複製代碼

查看新建立的mappingbash

curl -XGET http://127.0.0.1:9200/servcie/massage/_mapping?pretty

{
  "servcie" : {
    "mappings" : {
      "massage" : {
        "properties" : {
          "address" : {
            "type" : "string"
          },
          "age" : {
            "type" : "integer"
          },
          "is_open" : {
            "type" : "boolean"
          },
          "location" : {
            "type" : "geo_point"
          },
          "name" : {
            "type" : "string"
          },
          "price" : {
            "type" : "double"
          }
        }
      }
    }
  }
}
複製代碼

進入咱們的分詞測試

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"波多菠蘿蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "<IDEOGRAPHIC>",
    "position" : 2
  }, {
    "token" : "蘿",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "<IDEOGRAPHIC>",
    "position" : 3
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "<IDEOGRAPHIC>",
    "position" : 4
  } ]
}
複製代碼

分詞器是由一個分解器(tokenizer)、零個或多個詞元過濾器(token filters)組成app

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"abc dsf,sdsf"}'

複製代碼

中文檢索

若是使用中文檢索,還必須使用中文分詞,平時使用最多的可能就要屬IK分詞器了。curl

安裝IK分詞

./bin/plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.3/elasticsearch-analysis-ik-1.9.3.zip
複製代碼

重啓後查看插件(是否加載成功)elasticsearch

curl -XGET http://localhost:9200/_cat/plugins

Marrow analysis-ik 1.9.3 j  
複製代碼

使用ik分詞分析測試

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"波多菠蘿蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "菠蘿蜜",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "菠蘿",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "蘿",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 6
  } ]
}
複製代碼

能夠看到已經多菠蘿、菠蘿蜜進行了分詞url

隨着社會發展和不一樣的業務術語, 有些新的詞彙,並無收錄到咱們的IK分詞器, 即便使用match_pharse等查詢也存在檢索不到數據狀況,那咱們該怎麼辦呢?spa

舉個例子, 好比咱們但願能檢索出「吊炸天」這個詞(1.9.3版本的IK並無被收錄)插件

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸每天不容"}'

{
  "tokens" : [ {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "每天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 3
  } ]
}
複製代碼

若是必須的話, 這個時候咱們就須要修改IK的詞庫了

咱們修改analysis-ik/config/ik/custom 下 mydict.dic 文件, 這個文件是專門爲咱們拓展詞彙準備的, 再最後面添加好新詞後保存並重啓es便可

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸每天不容"}'


{
  "tokens" : [ {
    "token" : "吊炸天",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "每天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  } ]
}
複製代碼

咱們能夠看到已經對「吊炸天」進行了單獨的分詞.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。