切分詞語,normalization(提高recall召回率)html
給你一段句子,而後將這段句子拆分紅一個一個的單個的單詞,同時對每一個單詞進行normalization(時態轉換,單複數轉換)app
recall,召回率:搜索的時候,增長可以搜索到的結果的數量測試
character filter:在一段文本進行分詞以前,先進行預處理,好比說最多見的就是,過濾html標籤(<span>hello<span> --> hello),& --> and(I&you --> I and you)spa
tokenizer:分詞,hello you and me --> hello, you, and, meorm
token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 幹掉,mother --> mom,small --> littlehtm
一個分詞器,很重要,將一段文本進行各類處理,最後處理好的結果纔會拿去創建倒排索引索引
Set the shape to semi-transparent by calling set_trans(5)token
standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard)ip
simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, transit
whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer(特定的語言的分詞器,好比說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}
GET /_analyze
{
"analyzer": "english",
"text": "Text to analyze"
}
1、默認的分詞器
standard
standard tokenizer:以單詞邊界進行切分
standard token filter:什麼都不作
lowercase token filter:將全部字母轉換爲小寫
stop token filer(默認被禁用):移除停用詞,好比a the it等等
2、修改分詞器的設置
啓用english停用詞token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std": { //本身起的名字
"type": "standard",
"stopwords": "_english_" //啓用英語移除停用詞
}
}
}
}
}
測試
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}
GET /my_index/_analyze
{
"analyzer": "es_std",
"text":"a dog is in the house"
}
3、定製化本身的分詞器
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { //定義字符轉換名稱
"&_to_and": {
"type": "mapping",//映射
"mappings": ["&=> and"]
}
},
"filter": {
"my_stopwords": { //定義移除停用詞名稱
"type": "stop",//停用詞
"stopwords": ["the", "a"]
}
},
"analyzer": {
"my_analyzer": { //自定義須要使用的分詞器名稱
"type": "custom", //自定義
"char_filter": ["html_strip", "&_to_and"], //html_strip(應該是內置)表示移除html,&_to_and表示&轉換成and
"tokenizer": "standard", //基礎默認的分詞器
"filter": ["lowercase", "my_stopwords"] //lowercase表示內置的大小寫轉換,my_stopwords這個是自定義的移除停用詞
}
}
}
}
}
//自定義的分詞器分析
GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}
//使用自定義的分詞器
PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}