Elasticsearch 分詞器

時間 2019-11-06

標籤 elasticsearch 分詞器欄目日誌分析简体版

原文原文鏈接

不管是內置的分析器（analyzer），仍是自定義的分析器（analyzer），都由三種構件塊組成的：character filters ， tokenizers ， token filters。html

內置的analyzer將這些構建塊預先打包到適合不一樣語言和文本類型的analyzer中。git

Character filters （字符過濾器）github

字符過濾器以字符流的形式接收原始文本，並能夠經過添加、刪除或更改字符來轉換該流。正則表達式

舉例來講，一個字符過濾器能夠用來把阿拉伯數字（٠‎١٢٣٤٥٦٧٨‎٩）‎轉成成Arabic-Latin的等價物（0123456789）。算法

一個分析器可能有0個或多個字符過濾器，它們按順序應用。json

（PS：相似Servlet中的過濾器，或者攔截器，想象一下有一個過濾器鏈）數組

Tokenizer （分詞器）app

一個分詞器接收一個字符流，並將其拆分紅單個token （一般是單個單詞），並輸出一個token流。例如，一個whitespace分詞器當它看到空白的時候就會將文本拆分紅token。它會將文本「Quick brown fox!」轉換爲[Quick, brown, fox!]curl

（PS：Tokenizer 負責將文本拆分紅單個token ，這裏token就指的就是一個一個的單詞。就是一段文本被分割成好幾部分，至關於Java中的字符串的 split ）elasticsearch

分詞器還負責記錄每一個term的順序或位置，以及該term所表示的原單詞的開始和結束字符偏移量。（PS：文本被分詞後的輸出是一個term數組）

一個分析器必須只能有一個分詞器

Token filters （token過濾器）

token過濾器接收token流，而且可能會添加、刪除或更改tokens。

例如，一個lowercase token filter能夠將全部的token轉成小寫。stop token filter能夠刪除經常使用的單詞，好比 the 。synonym token filter能夠將同義詞引入token流。

不容許token過濾器更改每一個token的位置或字符偏移量。

一個分析器可能有0個或多個token過濾器，它們按順序應用。

小結&回顧

analyzer（分析器）是一個包，這個包由三部分組成，分別是：character filters （字符過濾器）、tokenizer（分詞器）、token filters（token過濾器）

一個analyzer能夠有0個或多個character filters

一個analyzer有且只能有一個tokenizer

一個analyzer能夠有0個或多個token filters

character filter 是作字符轉換的，它接收的是文本字符流，輸出也是字符流

tokenizer 是作分詞的，它接收字符流，輸出token流（文本拆分後變成一個一個單詞，這些單詞叫token）

token filter 是作token過濾的，它接收token流，輸出也是token流

因而可知，整個analyzer要作的事情就是將文本拆分紅單個單詞，文本 ----> 字符 ----> token

這就比如是攔截器

1. 測試分析器

analyze API 是一個工具，能夠幫助咱們查看分析的過程。（PS：相似於執行計劃）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
'

輸出：

{
    "tokens":[
        {
            "token":"The",
            "start_offset":0,
            "end_offset":3,
            "type":"word",
            "position":0
        },
        {
            "token":"quick",
            "start_offset":4,
            "end_offset":9,
            "type":"word",
            "position":1
        },
        {
            "token":"brown",
            "start_offset":10,
            "end_offset":15,
            "type":"word",
            "position":2
        },
        {
            "token":"fox.",
            "start_offset":16,
            "end_offset":20,
            "type":"word",
            "position":3
        }
    ]
}

能夠看到，對於每一個term，記錄了它的位置和偏移量

2. Analyzer

2.1. 配置內置的分析器

內置的分析器不用任何配置就能夠直接使用。固然，默認配置是能夠更改的。例如，standard分析器能夠配置爲支持中止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}
'

在這個例子中，咱們基於standard分析器來定義了一個std_englisth分析器，同時配置爲刪除預約義的英語中止詞列表。後面的mapping中，定義了my_text字段用standard，my_text.english用std_english分析器。所以，下面兩個的分詞結果會是這樣的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text", 
  "text": "The old brown cow"
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
'

第一個因爲用的standard分析器，所以分詞的結果是：[ the, old, brown, cow ]

第二個用std_english分析的結果是：[ old, brown, cow ]

2.2. Standard Analyzer （默認）

若是沒有特別指定的話，standard 是默認的分析器。它提供了基於語法的標記化（基於Unicode文本分割算法），適用於大多數語言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

上面例子中，那段文本將會輸出以下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

標準分析器接受下列參數：

max_token_length ：最大token長度，默認255
stopwords ：預約義的中止詞列表，如_english_ 或包含中止詞列表的數組，默認是 _none_
stopwords_path ：包含中止詞的文件路徑

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

以上輸出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定義

standard分析器由下列兩部分組成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默認被禁用）

你還能夠自定義

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}
'

2.3. Simple Analyzer

simple 分析器當它遇到只要不是字母的字符，就將文本解析成term，並且全部的term都是小寫的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸入結果以下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定義

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}
'

2.4. Whitespace Analyzer

whitespace 分析器，當它遇到空白字符時，就將文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出結果以下：

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，惟一不一樣的是，stop 分析器增長了對刪除中止詞的支持。默認用的中止詞是 _englisht_

（PS：意思是，假設有一句話「this is a apple」，而且假設「this」和「is」都是中止詞，那麼用simple的話輸出會是[ this , is , a , apple ]，而用stop輸出的結果會是[ a , apple ]，到這裏就看出兩者的區別了，stop 不會輸出中止詞，也就是說它不認爲中止詞是一個term）

（PS：所謂的中止詞，能夠理解爲分隔符）

2.5.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受如下參數：

stopwords ：一個預約義的中止詞列表（好比，_englisht_）或者是一個包含中止詞的列表。默認是 _english_
stopwords_path ：包含中止詞的文件路徑。這個路徑是相對於Elasticsearch的config目錄的一個路徑

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}
'

上面配置了一個stop分析器，它的中止詞有兩個：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

基於以上配置，這個請求輸入會是這樣的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正則表達式來將文本分割成terms，默認的正則表達式是\W+（非單詞字符）

2.6.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

因爲默認按照非單詞字符分割，所以輸出會是這樣的：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受以下參數：

pattern ：一個Java正則表達式，默認 \W+
flags ： Java正則表達式flags。好比：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否將terms所有轉成小寫。默認true
stopwords ：一個預約義的中止詞列表，或者包含中止詞的一個列表。默認是 _none_
stopwords_path ：中止詞文件路徑

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}
'

上面的例子中配置了按照非單詞字符或者下劃線分割，而且輸出的term都是小寫

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}
'

所以，基於以上配置，本例輸出以下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不一樣語言環境下的文本分析。內置（預約義）的語言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定義Analyzer

前面也說過，一個分析器由三部分構成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 實例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}
'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

4. 中文分詞器

4.1. smartCN

一個簡單的中文或中英文混合文本的分詞器

這個插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，並且不須要配置

# 安裝
bin/elasticsearch-plugin install analysis-smartcn
# 卸載
bin/elasticsearch-plugin remove analysis-smartcn

下面測試一下

能夠看到，「今每天氣真好」用smartcn分析器的結果是：

[ 今天 ， 天氣 ， 真 ， 好 ]

若是用standard分析器的話，結果會是：

[ 今 ，天 ，氣 ， 真 ， 好 ]

4.2. IK分詞器

下載對應的版本，這裏我下載6.5.3

而後，在Elasticsearch的plugins目錄下建一個ik目錄，將剛纔下載的文件解壓到該目錄下

最後，重啓Elasticsearch

接下來，仍是用剛纔那句話來測試一下

輸出結果以下：

{
    "tokens": [
        {
            "token": "今每天氣",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "每天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天氣",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "真好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}