Elasticsearch學習筆記——分詞

時間 2019-11-06

標籤 elasticsearch 學習筆記分詞欄目日誌分析简体版

原文原文鏈接

1.測試Elasticsearch的分詞git

Elasticsearch有多種分詞器（參考:https://www.jianshu.com/p/d57935ba514b）github

Set the shape to semi-transparent by calling set_trans(5)bash

（1）standard analyzer：標準分詞器（默認是這種）
set,the,shape,to,semi,transparent by,calling,set_trans,5elasticsearch

（2）simple analyzer：簡單分詞器
set, the, shape, to, semi, transparent, by, calling, set, trans測試

（3）whitespace analyzer：空白分詞器。大小寫，下劃線等都不會轉換
Set, the, shape, to, semi-transparent, by, calling, set_trans(5)spa

（4）language analyzer：（特定語言分詞器，好比說English英語分瓷器）
set, shape, semi, transpar, call, set_tran, 5code

2.爲Elasticsearch的index設置分詞blog

這樣就將這個index裏面的全部type的分詞設置成了simpletoken

PUT my_index
{
"settings": {
    "analysis": {
      "analyzer": {"default":{"type":"simple"}}
    }
  }
}

標準分詞器 : standard analyzer

http://localhost:9200/_analyze?analyzer=standard&pretty=true&text=test測試

分詞結果ip

{
  "tokens" : [
    {
      "token" : "test",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "測",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "試",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    }
  ]
}

簡單分詞器 : simple analyzer

http://localhost:9200/_analyze?analyzer=simple&pretty=true&text=test_測試

結果

{
  "tokens" : [
    {
      "token" : "test",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "測試",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    }
  ]
}

IK分詞器 : ik_max_word analyzer 和 ik_smart analyzer

首先須要安裝

https://github.com/medcl/elasticsearch-analysis-ik

下zip包,而後使用install plugin進行安裝,我機器上的es版本是5.6.10,因此安裝的就是5.6.10

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.10/elasticsearch-analysis-ik-5.6.10.zip

而後從新啓動Elasticsearch就能夠了

進行測試

http://localhost:9200/_analyze?analyzer=ik_max_word&pretty=true&text=test_tes_te測試

結果

{
  "tokens" : [
    {
      "token" : "test_tes_te",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "tes",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "te",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "測試",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}