Elasticsearch Analyzer 的內部機制

時間 2019-11-17

原文原文鏈接

1 本文將介紹各類 Analyzer，以及他們各類的應用場景。

涉及到的概念php

Analyzer 通常由三部分構成，character filters、tokenizers、token filters。掌握了 Analyzer 的原理，就能夠根據咱們的應用場景配置 Analyzer。html

Elasticsearch 有10種分詞器（Tokenizer）、31種 token filter，3種 character filter，一大堆配置項。此外，還有還能夠安裝 plugin 擴展功能。這些都是搭建 analyzer 的原材料。java

2 Analyzer 的組成要素

Analyzer 的內部就是一條流水線git

Step 1 字符過濾（Character filter）
Step 2 分詞（Tokenization）
Step 3 Token 過濾（Token filtering）

Elasticsearch 已經默認構造了 8個 Analyzer。若沒法知足咱們的需求，能夠經過「Setting API」構造 Analyzer。github

PUT /my-index/_settings
{
    "index": {
        "analysis": {
            "analyzer": {
                "customHTMLSnowball": {
                    "type": "custom",
                    "char_filter": [
                        "html_strip"
                    ],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "stop",
                        "snowball"
                    ]
                }
            }
        }
    }
}

以上自定義的 Analyzer名爲 customHTMLSnowball，表明的含義：算法

移除 html 標籤（html_strip character filter），好比 <p> <a> <div> 。json
分詞，去除標點符號（standard tokenizer）app
把大寫的單詞轉爲小寫（lowercase token filter）less
過濾停用詞（stop token filter），好比「the」「they」「i」「a」「an」「and」。curl
提取詞幹（snowball token filter，snowball 雪球算法是提取英文詞幹最經常使用的一種算法。）

cats -> cat

catty -> cat

stemmer -> stem

stemming -> stem

stemmed -> stem

The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>

一圖勝前言，這段文本交給 customHTMLSnowball ，它是這樣處理的。

3 如何選擇合適的 Analyzer？

3.1 大篇幅的英文改選用哪一種 analyzer？

當咱們的搜索場景爲：英文博文、英文新聞、英文論壇帖等大段的文本時，最好使用包含 stemming token filter 的 analyzer。

常見的 stemming token filter 有這幾種： stemmer, snowball, porter_stem。

拿 snowball token filter 舉例，它把 sing/ sings / singing 都轉化詞幹 sing。而且丟棄了「they」「are」兩個停用詞。無論用戶搜 sing、sings、singing，他的搜索結果都是基於「sing」這個term，所得的結果集都同樣。

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }

詞幹提取在英文搜索種應用普遍，可是也有侷限：

詞幹提取對中文意義不大（毫無心義？）。
搜索專業術語，人名時，詞幹提取反而讓搜索結果變差。

eg： flying fish 與 fly fishing 意思差之千里，但通過 snowball 處理後的他們的詞根（Term）相同 fli fish。

當用戶搜索「假蠅釣魚」信息時，出來的倒是「飛魚」的結果，搜索結果十分不理想。

此類場景，建議使用精準搜索，採用簡單的分詞策略（不提取詞幹，只 lowercase）+ Fuzzy query 多是更好的選擇。

3.2 該選用哪一種 analyzer 處理中文？

英文的分詞比較簡單，根據空格，標點符號就能夠分的八九不離十。可是中文詞與詞之間沒有空格，德文偶爾兩個詞會連在一塊兒，使用默認的 standard analyzer 就不靈光了。

> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d '耶穌爬山寶訓' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "穌", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "寶", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "訓", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }

standard analyzer 將「耶穌爬山寶訓」處理爲5個獨立的字，這不太靠譜。比較理想的結果應該爲["耶穌", "爬山寶訓"]。

此時咱們須要藉助一些插件（plugin）來處理中文的分詞。mmseg 是處理中文一個比較靠譜的插件。安裝後能夠引入 mmseg-analyzer，處理中文還不錯。

3.3 Searching Tokens Exactly 精準搜索

當咱們搜索用戶名(username)，商品分類（category），標籤（tag）時，但願精準搜索。建索引時最好不要再分詞、也不要提取詞幹，徹底能夠跳過 analyzer 這一步。

能夠在某個字段的 mapping 中指定 "index": "not_analyzed"，從而直接把原始文本轉爲 term。

4 IK中文分詞器配置

先測試ik分詞器的基本功能

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "中華人民共和國國歌"
}

結果：

{
    "tokens": [
        {
            "token": "中華人民共和國",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "國歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

能夠看出：經過ik_smart明顯很智能的將 "中華人民共和國國歌"進行了正確的分詞。

另一個例子：

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者榮耀是最好玩的遊戲"
}

結果:

{
    "tokens": [
        {
            "token": "王者",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "榮耀",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "最",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "好玩",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "遊戲",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

若是結果跟個人不同，那就對了，中文ik分詞詞庫裏面將「王者榮耀」是分開的，可是咱們又不肯意將其分開，根據github上面的指示能夠配置

IKAnalyzer.cfg.xml 目錄在：elasticsearch-5.4.0/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 擴展配置</comment>
	<!--用戶能夠在這裏配置本身的擴展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用戶能夠在這裏配置本身的擴展中止詞字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
	<!--用戶能夠在這裏配置遠程擴展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用戶能夠在這裏配置遠程擴展中止詞字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

//TODO

配置完了以後就能夠看到剛纔的結果了

順便測試一下ik_max_word

POST _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text": "中華人民共和國國歌"
}
結果看看就好了
{
  "tokens": [
    {
      "token": "中華人民共和國",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中華人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中華",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "華人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和國",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和國",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "國",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "國歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

再看看github上面的一個例子

POST /index/fulltext/_mapping
{
  "fulltext": {
    "_all": {
      "analyzer": "ik_smart"
    },
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

存一些值

POST /index/fulltext/1
{
  "content": "美國留給伊拉克的是個爛攤子嗎"
}

POST /index/fulltext/2
{
  "content": "公安部：各地校車將享最高路權"
}

POST /index/fulltext/3
{
  "content": "中韓漁警衝突調查：韓警平均天天扣1艘中國漁船"
}

POST /index/fulltext/4
{
  "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}

取值

POST /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "中國"
    }
  }
}

結果：

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.0869478,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 1.0869478,
        "_source": {
          "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.61094594,
        "_source": {
          "content": "中韓漁警衝突調查：韓警平均天天扣1艘中國漁船"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "1",
        "_score": 0.27179778,
        "_source": {
          "content": "美國留給伊拉克的是個爛攤子嗎"
        }
      }
    ]
  }
}

es會按照分詞進行索引，而後根據你的查詢條件按照分數的高低給出結果

官網有一個例子，能夠學習學習：https://github.com/medcl/elasticsearch-analysis-ik

看另外一個有趣的例子

PUT /index1
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, 
     "number_of_replicas" : 0 
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } 
    },
    "resource": {
      "dynamic": false, 
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "cn": {
              "type": "text",
              "analyzer": "ik_smart"
            },
            "en": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

field的做用有二：

1.好比一個string類型能夠映射成text類型來進行全文檢索，keyword類型做爲排序和聚合;
2 至關於起了個別名，使用不一樣的分類器

批量插入值

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星馳最新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星馳最好看的新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星馳最新電影，最好，新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

取值

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title"
    }
  }
}

結果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

緣由，使用title裏面查詢fox,而title使用的是Standard標準分詞器，被索引的是foxes，因此不會有結果，下面這種狀況就會有結果了

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title.en"
    }
  }
}

結果就不列出來了，由於title.en使用的是english分詞器

對比一下下面的輸出，體會一下field的使用

GET /index1/resource/_search
{
  "query": {
    "match": {
      "title.cn": "the最好遊戲"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新遊戲",
      "fields": [ "title", "title.cn", "title.en" ]
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新",
      "fields": "title.cn"
    }
  }
}

根據結果體會體會用法

下面使用「王者榮耀作測試」，這裏能夠看到前面配置的HotWords.php是一把雙刃劍，將「王者榮耀」放在裏面以後，「王者榮耀」這個詞就是一個總體，不會被切分紅「王者」和「榮耀」，可是就是要搜索王者怎麼辦呢，這裏就體現出fields的強大了，具體看下面

先存入數據

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 6 } }
{ "title": "王者榮耀最好玩的遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 7 } }
{ "title": "王者榮耀最好玩的新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 8 } }
{ "title": "王者榮耀最新遊戲，最好玩，新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 9 } }
{ "title": "最最最最好的新新新新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 10 } }
{ "title": "I'm not happy about the foxes" }

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者榮耀",
      "fields": "title.cn"
    }
  }
}

#下面會沒有結果返回
POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title.cn"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title"
    }
  }
}

對比結果就能夠一目瞭然了，結果略！

因此一開始業務的需求要至關了解，纔能有好的映射（mapping）被設計，搜索的時候也會省事很多

查看分詞的命令， ES配置完成後須要測試分詞，看看分詞是否達到預期效果。

curl 命令查看：

1. 使用自定義的分析器查看分詞：ansj_index_synonym：自定交分析器名稱. pretty ：json格式顯示

curl -XGET 'http://localhost:8200/zh/_analyze?analyzer=ansj_index_synonym&pretty' -d '童裝童鞋'

2. 使用自定義的分詞器（tokenizer）和過濾器（filters）查看分詞：

curl -XGET 'http://localhost:8200/zh/_analyze?tokenizer=ansj_index&filters=synonym&pretty' -d '童裝童鞋'

3. 查詢某個字段的分詞：

curl -XGET 'http://localhost:8200/zh/_analyze?field=brand_name&pretty' -d '童裝童鞋'

「brand_name」：字段名稱，若是是字段是nest,object類型，也能夠寫成"brand_name. name"

除了自定義本身的分析器，ES本身也有內置分析器如：

standard
simple
whitespace
stop
keyword
pattern
language
snowball
custom

具體解釋：http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

須要英文好點在同鞋。

ES還內置了分詞器和過濾器：

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.htmlstandard
edge_ngram
keyword
letter
lowercase
ngram
whitespace
pattern
uax_email_url
path_hierarchy
ascii folding
length
lowercase
uppercase
nGram
edge_ngram
porter_stem
shingle
stop
word_delimiter
stemmer
stemmer_override
keyword_marker
keyword_repeat
kstem
snowball
phonetic
synonym
reverse
elision
truncate
unique
pattern_capture
pattern_replace
trim
limit
hunspell
common_grams
normalization
delimited_payload
keep_words