1:Analyzer通常是由三部分組成:
character filters、tokenizers、token filtershtml
2 Analyzer 的組成要素:
Analyzer 的內部就是一條流水線git
Step 1 字符過濾(Character filter)
Step 2 分詞 (Tokenization)
Step 3 Token 過濾(Token filtering)算法
3:Analyzer pipeline:
(input)
——---String----->> (CharacterFilters)
-----String----->> (Tokenizer)
-----Tokens----->> (TokensFilters)
-----Tokens----->>
(outpur)json
========================例子1==========================app
{
"index": {
"analysis": {
"analyzer": {
"customHTMLSnowball": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
}
}curl
以上自定義的 Analyzer名爲 customHTMLSnowball, 表明的含義:
移除 html 標籤 (html_strip character filter),好比 <p> <a> <div> 。
分詞,去除標點符號(standard tokenizer)
把大寫的單詞轉爲小寫(lowercase token filter)
過濾停用詞(stop token filter),好比 「the」 「they」 「i」 「a」 「an」 「and」。
提取詞幹(snowball token filter,snowball 雪球算法 是提取英文詞幹最經常使用的一種算法。)
cats -> cat
catty -> cat
stemmer -> stem
stemming -> stem
stemmed -> stemurl
========================例子1==========================htm
========================例子2==========================blog
裸心es搜索,拼音搜索token
curl -XPUT "http://localhost:9200/yyyy" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
],
"filter": [
"pinyin_filter",
"lowercase",
"stop",
"ngram_1_20"
]
},
"default_search": {
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
]
}
},
"filter": {
"ngram_1_20": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"pinyin_filter": {
"type": "pinyin",
"keep_original": true,
"keep_joined_full_pinyin": true
}
}
}
}
}'
========================例子2==========================