Elasticsearch——全文搜索

時間 2019-12-13

標籤 elasticsearch 全文搜索欄目日誌分析简体版

原文原文鏈接

1. 精準匹配與全文搜索

1.1 精準匹配

exact valuehtml

2017-01-01，exact value，搜索的時候，必須輸入2017-01-01，才能搜索出來
若是你輸入一個01，是搜索不出來的app

1.2 全文搜索

full text測試

縮寫 vs. 全程：cn vs. china
格式轉化：like liked likes
大小寫：Tom vs tom
同義詞：like vs love

例如：spa

2017-01-01，2017 01 01，搜索2017，或者01，均可以搜索出來
china，搜索cn，也能夠將china搜索出來
likes，搜索like，也能夠將likes搜索出來
Tom，搜索tom，也能夠將Tom搜索出來
like，搜索love，同義詞，也能夠將like搜索出來

就不是說單純的只是匹配完整的一個值，而是能夠對值進行拆分詞語後（分詞）進行匹配，也能夠經過縮寫、時態、大小寫、同義詞等進行匹配code

2. 倒排索引

doc1：I konw my mom likes small dogs.orm

doc2：His mom likes dogs, so do I.htm

分詞，初步創建倒排索引：索引

Word	doc1	doc2
I	√	√
konw	√
my	√
mom	√	√
likes	√	√
small	√
dogs	√	√
His		√
so		√
do		√

若是咱們想搜索 mother like little dog，是不會有任何結果的。token

這不是咱們想要的結果，爲在咱們看來，mother和mom有區別嗎？同義詞，都是媽媽的意思。like和liked有區別嗎？沒有，都是喜歡的意思，只不過一個是如今時，一個是過去時。little和small有區別嗎？同義詞，都是小小的。dog和dogs有區別嗎？狗，只不過一個是單數，一個是複數。ip

實際上，es在創建倒排索引的時候進行了 normalization 操做，對拆分出的各個單詞進行相應的處理，以提高後面搜索的時候可以搜索到相關聯的文檔的機率。
好比，時態的轉換，單複數的轉換，同義詞的轉換，大小寫的轉換。

3. 分詞器

3.1 分詞器的做用

切分詞語
進行 normalization（提示recall召回率）
給你一段句子，而後將這段句子拆分紅一個一個的單個的單詞，同時對每一個單詞進行normalization（時態轉換，單複數轉換）。

recall 即召回率，就是在搜索的時候，增長可以搜索到的結果的數量。

分析器包含三部分：

character filter：在一段文本進行分詞以前，先進行預處理，好比說最多見的就是，過濾html標籤（hello --> hello），& --> and（I&you --> I and you）
tokenizer：分詞，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 幹掉，mother --> mom，small --> little

3.2 內置分詞器介紹

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer s

set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard）
simple analyzer

set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的語言的分詞器，好比說，english，英語分詞器）

set, shape, semi, transpar, call, set_tran, 5

3.3 測試分詞器

語法：

 1GET /_analyze
 2{
 3  "analyzer": "standard",
 4  "text": "Text to analyze"
 5}
 6返回：
 7{
 8  "tokens": [
 9    {
10      "token": "text",
11      "start_offset": 0,
12      "end_offset": 4,
13      "type": "<ALPHANUM>",
14      "position": 0
15    },
16    {
17      "token": "to",
18      "start_offset": 5,
19      "end_offset": 7,
20      "type": "<ALPHANUM>",
21      "position": 1
22    },
23    {
24      "token": "analyze",
25      "start_offset": 8,
26      "end_offset": 15,
27      "type": "<ALPHANUM>",
28      "position": 2
29    }
30  ]
31}
複製代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。