Elasticsearch由淺入深(八)搜索引擎:mapping、精確匹配與全文搜索、分詞器、mapping總結

下面先簡單描述一下mapping是什麼?html

自動或手動爲index中的type創建的一種數據結構和相關配置,簡稱爲mapping
dynamic mapping,自動爲咱們創建index,建立type,以及type對應的mapping,mapping中包含了每一個field對應的數據類型,以及如何分詞等設置web

當咱們插入幾條數據,讓ES自動爲咱們創建一個索引數組

PUT /website/article/1
{
  "post_date": "2019-08-21",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

PUT /website/article/2
{
  "post_date": "2019-08-22",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}

PUT /website/article/3
{
  "post_date": "2019-08-23",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}

查看mapping數據結構

GET /website/_mapping

{
  "website": {
    "mappings": {
      "article": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "post_date": {
            "type": "date"
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

上面是插入數據自動生成的mapping,還有手動生成的mapping。這種自動或手動爲index中的type創建的一種數據結構和相關配置,稱爲mapping。app

嘗試各類搜索ide

GET /website/article/_search?q=2019            //3條結果             
GET /website/article/_search?q=2019-08-21            //3條結果
GET /website/article/_search?q=post_date:2019-08-21       //1條結果
GET /website/article/_search?q=post_date:2019         //0條結果

搜索結果爲何不一致,由於es自動創建mapping的時候,設置了不一樣的field不一樣的data type。不一樣的data type的分詞、搜索等行爲是不同的。因此出現了_all field和post_date field的搜索表現徹底不同。
下面是手動建立的mapping。post

PUT /test_mapping
{
  "mappings" : {
    "properties" : {
      "author_id" : {
        "type" : "long"
      },
      "content" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "post_date" : {
        "type" : "date"
      },
      "title" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      }
    }
  }
}
View Code

精確匹配與全文搜索的對比分析

exact value

也就是某個field必須所有匹配才能返回相應的document
示例:測試

GET /website/article/_search?q=post_date:2019-08-21       //1條結果
GET /website/article/_search?q=post_date:2019         //0條結果

exact value,搜索的時候,必須輸入2019-08-21,才能搜索出來
若是你輸入一個21,是搜索不出來的ui

full text

full text與exact value不同,不是說單純的只是匹配完整的一個值,而是能夠對值進行拆分詞語後(分詞)進行匹配,也能夠經過縮寫、時態、大小寫、同義詞等進行匹配。
示例:this

GET /website/article/_search?q=2019            //3條結果             
GET /website/article/_search?q=2019-08-21            //3條結果

 

倒排索引核心原理

下面演示一下倒排索引簡單創建的過程,固然實際中倒排索引的創建過程會很是的複雜。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

分詞,初步的倒排索引的創建

word    doc1    doc2
I        *        *
really   *
liked    *        *
my       *        *
small    *
dogs     *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

搜索 mother like little dog, 不會有任何結果
mother
like 
little
dog
這確定不是咱們想要的結果。好比mother和mom其實根本就沒有區別。可是卻檢索不到。可是作下測試發現ES是能夠查到的。實際上ES在創建倒排索引的時候,還會執行一個操做,就是會對拆分的各個單詞進行相應的處理,以提高後面搜索的時候可以搜索到相關聯的文檔的機率。像時態的轉換,單複數的轉換,同義詞的轉換,大小寫的轉換。這個過程稱爲正則化(normalization)
mother-> mom
liked -> like
small -> little
dogs -> dog
這樣從新創建倒排索引:

word    doc1    doc2
I        *        *
really   *
like     *        *
my       *        *
little   *
dog      *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

查詢:mother like little dog 分詞正則化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都會搜索出來
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.

分詞器

切分詞語,normalization(提高recall召回率)

給你一段句子,而後將這段句子拆分紅一個一個的單個的單詞,同時對每一個單詞進行normalization(時態轉換,單複數轉換),分瓷器
recall,召回率:搜索的時候,增長可以搜索到的結果的數量

  • character filter:在一段文本進行分詞以前,先進行預處理,好比說最多見的就是,過濾html標籤(<span>hello<span> --> hello),& --> and(I&you --> I and you)
  • tokenizer:分詞,hello you and me --> hello, you, and, me
  • token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 幹掉,mother --> mom,small --> little

一個分詞器,很重要,將一段文本進行各類處理,最後處理好的結果纔會拿去創建倒排索引

內置分詞器的介紹:

待分詞:Set the shape to semi-transparent by calling set_trans(5)

standard analyzerset, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard)
simple analyzerset, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer(特定的語言的分詞器,好比說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5

mapping引入案例遺留問題大揭祕

GET /_search?q=2019

搜索的是_all field,document全部的field都會拼接成一個大串,進行分詞

2019-01-02 my second article this is my second article in this website 11400

        doc1        doc2        doc3
2019      *          *           *
01        *         
02                   *
03                               *

_all,2017,天然會搜索到3個docuemnt

GET /_search?q=post_date:2019-01-01

date,會做爲exact value去創建索引

             doc1        doc2        doc3
2017-01-01    *        
2017-01-02                 *         
2017-01-03                             *

測試分詞器

語法:

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}
{
  "tokens": [
    {
      "token": "text",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "to",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyze",
      "start_offset": 8,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

對mapping進一步總結

  1. 往ES裏面直接插入數據,ES會自動創建索引,同時創建type以及對應的mapping
  2. mapping中自動定義了每一個fieldd的數據類型
  3. 不一樣的數據類型(好比說text和date),可能有的是exact value,有的是full text
  4. exact value,在創建倒排索引的時候,分詞的時候,都是將整個值一塊兒做爲關鍵字創建到倒排索引中;full text會經歷各類各樣的處理,分詞,normalization(時態轉換,同義詞轉換,大小寫轉換),纔會創建到倒排索引中
  5. 在搜索的時候,exact value和full text類型就決定了,對exact value和full text field進行搜索的行爲也是不同的,會跟創建倒排索引的行爲保持一致;好比說exact value搜索的時候,就是直接按照整個值進行匹配,full text也會進行分詞和正則化normalization再去倒排索引中去搜索。
  6. 能夠用 ES的dynamic mapping,讓其自動創建mapping,包括自動設置數據類型;也能夠提早手動建立index和type的mapping,本身對各個field進行設置,包括數據類型,包括索引行爲,包括分析器等等。

mapping本質上就是index的type的元數據,決定了數據類型,創建倒排索引的行爲,還有進行搜索的行爲。

mapping核心數據類型以及dynamic mapping

  • 核心數據類型
    string text:字符串類型
    byte:字節類型
    short:短整型
    integer:整型
    long:長整型
    float:浮點型
    boolean:布爾類型
    date:時間類型

    固然還有一些高級類型,像數組,對象object,但其底層都是text字符串類型

  • dynamic mapping
    true or false -> boolean
    123 -> long
    123.45 -> float
    2017-01-01 -> date
    "hello world" -> string text

     

  • 查看mapping

    語法:
    GET /{index}/_mapping
    GET /{index}/_mapping/{type}

手動創建和修改mapping以及定製string類型是否分詞

注意:只能建立index時手動創建mapping,或者新增field mapping,可是不能update field mapping。

  • "analyzer": "standard":自動分詞
  • date:日期
  • keyword:不分詞
# 建立索引
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "content": {
        "type": "text"
      },
      "post_date": {
        "type": "date"
      },
      "publisher_id": {
        "type": "keyword"
      }
    }
  }
}


#修改字段的mapping
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "text"
      }
    }
  }
}

{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
        "index_uuid": "5xLohnJITHqCwRYInmBFmA",
        "index": "website"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
    "index_uuid": "5xLohnJITHqCwRYInmBFmA",
    "index": "website"
  },
  "status": 400
}


#增長mapping的字段
PUT /website/_mapping
{
  "properties": {
    "new_field": {
      "type": "text"
    }
  }
}

{
  "acknowledged" : true
}

mapping複雜類型y以及object類型數據底層結構

  1. multivalue field
    {
        "tags": ["tag1", "tag2"]
    }

    創建索引時與string是同樣的,數據類型不能混

  2. empty field
    null,[],[null]
  3. object field
    初始化數據:
    PUT /company/employee/1
    {
      "address": {
        "country": "china",
        "province": "guangdong",
        "city": "guangzhou"
      },
      "name": "jack",
      "age": 27,
      "join_date": "2017-01-01"
    }

    查看mapping

    GET /company/_mapping/employee
    {
      "company": {
        "mappings": {
          "employee": {
            "properties": {
              "address": {
                "properties": {
                  "city": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "country": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "province": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              },
              "age": {
                "type": "long"
              },
              "join_date": {
                "type": "date"
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
    View Code

    object field底層解析

    {
      "address": {
        "country": "china",
        "province": "guangdong",
        "city": "guangzhou"
      },
      "name": "jack",
      "age": 27,
      "join_date": "2017-01-01"
    }

    ↓↓↓↓

    {
        "name":            [jack],
        "age":          [27],
        "join_date":      [2017-01-01],
        "address.country":         [china],
        "address.province":   [guangdong],
        "address.city":  [guangzhou]
    }
    {
        "authors": [
            { "age": 26, "name": "Jack White"},
            { "age": 55, "name": "Tom Jones"},
            { "age": 39, "name": "Kitty Smith"}
        ]
    }

    ↓↓↓↓

    {
        "authors.age":    [26, 55, 39],
        "authors.name":   [jack, white, tom, jones, kitty, smith]
    }
相關文章
相關標籤/搜索