ElasticSearch - How to search for a part of a word with ElasticSearch

Search a part of word with ElasticSearch

來自stackoverflowhtml

https://stackoverflow.com/questions/6467067/how-to-search-for-a-part-of-a-word-with-elasticsearchgit

場景還原

// 初始化數據

POST /my_idx/my_type/_bulk
{"index": {"_id": "1"}}
{"name": "John Doeman", "function": "Janitor"}
{"index": {"_id": "2"}}
{"name": "Jane Doewoman", "function": "Teacher"}
{"index": {"_id": "3"}}
{"name": "Jimmy Jackal", "function": "Student"}

Question

ElasticSearch中有數據以下:json

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
}

如今指望搜索全部包含Doe的文檔elasticsearch

// 並無返回任何文檔

GET /my_idx/my_type/_search?q=Doe
// 返回一個文檔

GET /my_idx/my_type/_search?q=Doeman

提問者還更換了分詞器,改用請求體的方式,但這也不行:ide

GET /my_idx/my_type/_search
{
  "query": {
    "term": {
      "name": "Doe"
    }
  }
}

後來使用了nGramtokenizerfilter測試

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "bulk_size": "100",
    "bulk_timeout": "10ms",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "my_ngram_filter"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      }
    }
  }
}

引入了另一個問題:任意的查詢均可以返回全部文檔ui

Answers

首先這是一個分詞引發的問題,索引默認狀況下使用standard分詞器,對於文檔:code

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
}

索引後會獲得這樣一個映射,這裏只考慮了name字段的分詞:regexp

segment document id list
john 1
doeman 1
jane 2
doewoman 2
jimmy 3
jackal 3

那麼如今考慮咱們的搜索htm

Search 1

GET /my_idx/my_type/_search?q=Doe

standard分詞器會將Doe分析爲doe,而後到索引表中查找,並不會找到doe這個索引,所以返回空

Search 2

GET /my_idx/my_type/_search?q=Doeman

standard分詞器會將Doeman分析爲doeman,而後到索引表中找到了該索引,會發現只有doc ID 1包含該索引,因此只返回一個文檔

Search 3

GET /my_idx/my_type/_search
{
    "query": {
        "term": {
            "name": "Doe"
        }
    }
}

term查詢,Doe仍是Doe,不會被分析器分析,可是Doe在索引表中依然是不存在的,因此這個方法也沒法返回任何文檔。

Search 4

額外說明,題主並無用這種方式試過

GET /my_idx/my_type/_search
{
    "query": {
        "term": {
            "name": "Doeman"
        }
    }
}

不要覺得這樣就能找到了,由於term不進行分析,因此直接從索引表中找Doeman也是沒有任何文檔匹配的,除非把Doeman改成doeman

解決方案

總結了一下stackoverflow上的答案,目前有這麼幾種可行方案:

  • 正則匹配法
  • 通配符匹配法
  • 前綴匹配法
  • nGram分詞器法

正則匹配法

GET my_idx/my_type/_search
{
  "query": {
    "regexp": {
      "name": "doe.*"
    }
  }
}

通配符匹配法

使用query_string配合通配符進行查詢,須要注意的是,通配符查找可能使用大量內存且效率低下

後綴匹配(前導通配符)是很是重的操做(e.g. "*ing"),索引中全部的term都會被查找一遍,能夠經過allow_leading_wildcard來關閉後綴匹配功能

GET my_idx/my_type/_search
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "Doe*"
    }
  }
}

前綴匹配法

原答案說使用prefix,可是prefix並無對查詢進行分析,這裏咱們使用match_phrase_prefix

GET my_idx/my_type/_search
{
  "query": {
    "match_phrase_prefix": {
      "name": {
        "query": "Doe",
        "max_expansions": 10
      }
    }
  }
}

nGram分詞器法

建立索引

PUT my_idx
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

測試一下分詞器

POST my_idx/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Doeman"
}

// response

{
  "tokens": [
    {
      "token": "Doe",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "oem",
      "start_offset": 1,
      "end_offset": 4,
      "type": "word",
      "position": 1
    },
    {
      "token": "ema",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "man",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 3
    }
  ]
}

再查就能夠查到了。而題主雖然使用了ngram,可是min_grammax_gram都配置爲1

長度越小,匹配到的文檔越多,但匹配的質量會越差
長度越大,檢索到的文檔越匹配。推薦使用長度爲3的tri-gram官方文檔對此有詳細介紹

相關文章
相關標籤/搜索