elasticsearch學習筆記高級篇（十一）——多字段搜索（下）

時間 2019-12-04

原文原文鏈接

承接上一篇博客 https://segmentfault.com/a/11...segmentfault

四、most_fields查詢

most_fields是以字段爲中心，這就使得它會查詢最多匹配的字段。
假設咱們有一個讓用戶搜索地址。其中有兩個文檔以下：app

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

使用most_fields進行查詢：dom

GET /test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "street": "Poland Street W1V"
          }
        },
        {
          "match": {
            "city": "Poland Street W1V"
          }
        },
        {
          "match": {
            "country": "Poland Street W1V"
          }
        },
        {
          "match": {
            "postcode": "Poland Street W1V"
          }
        }
      ]
    }
  }
}

咱們發現對每一個字段重複查詢字符串很快就會顯得冗長，此時用multi_match進行簡化以下:post

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "most_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

結果：設計

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.3835402,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3835402,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

若是用best_fields,那麼doc2會在doc1的前面code

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

結果：排序

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.99938464,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.99938464,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      }
    ]
  }
}

使用most_fields存在的問題

（1）它被設計用來找到匹配任意單詞的多數字段，而不是找到跨越全部字段的最匹配的單詞
（2）它不能使用operator或者minimum_should_match參數來減小低相關度結果帶來的長尾效應
（3）每一個字段的詞條頻度是不一樣的，會互相干擾最終獲得較差的排序結果索引

五、全字段查詢使用copy_to參數

上面那說了most_fields的問題，下面就來解決一下這個問題，解決這個問題的第一種方式就是使用copy_to參數。
咱們能夠用copy_to將多個field組合成一個field
創建以下索引：ip

DELETE /test_index
PUT /test_index
{
  "mappings": {
    "properties": {
      "street": {
        "type": "text",
        "copy_to": "full_address"
      },
      "city": {
        "type": "text",
        "copy_to": "full_address"
      },
      "country": {
        "type": "text",
        "copy_to": "full_address"
      },
      "postcode": {
        "type": "text",
        "copy_to": "full_address"
      },
      "full_address": {
        "type": "text"
      }
    }
  }
}

插入以前的數據：ci

PUT /test_index/_create/1
{
    "street":   "5 Poland Street",
    "city":     "Poland",
    "country":  "United W1V",
    "postcode": "W1V 3DG"
}

PUT /test_index/_create/2
{
    "street":   "5 Poland Street W1V",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "3DG"
}

查詢：

GET /test_index/_search
{
  "query": {
    "match": {
      "full_address": "Poland Street W1V"
    }
  }
}

結果：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.68370587,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68370587,
        "_source" : {
          "street" : "5 Poland Street",
          "city" : "Poland",
          "country" : "United W1V",
          "postcode" : "W1V 3DG"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5469647,
        "_source" : {
          "street" : "5 Poland Street W1V",
          "city" : "London",
          "country" : "United Kingdom",
          "postcode" : "3DG"
        }
      }
    ]
  }
}

咱們能夠發現這樣變成一個字段full_address以後，就能夠解決most_fields的問題了。

五、cross_fields查詢

解決most_fields的問題的第二種方式就是使用cross_fields查詢。
若是咱們在索引文檔以前都可以使用_all或是提早定義好copy_to的話，那就沒什麼問題。可是，Elasticsearch同時也提供了一個搜索期間的解決方案就是使用cross_fields查詢。cross_fields採用了一種以詞條爲中心的方法，這種方法和best_fields以及most_fields採用的以字段爲中心的方法有很大的區別。它將全部的字段視爲一個大的字段，而後在任一字段中搜索每一個詞條。
下面解釋一下以字段爲中心和以詞條爲中心的區別。

以字段爲中心

經過查詢：

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "best_fields",
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

獲得：

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "((postcode:poland postcode:street postcode:w1v) | (country:poland country:street country:w1v) | (city:poland city:street city:w1v) | (street:poland street:street street:w1v))"
    }
  ]
}

((postcode:poland postcode:street postcode:w1v) |
(country:poland country:street country:w1v) |
(city:poland city:street city:w1v) |
(street:poland street:street street:w1v))
這個就是規則。
將operator設置成and就變成
((+postcode:poland +postcode:street +postcode:w1v) |
(+country:poland +country:street +country:w1v) |
(+city:poland +city:street +city:w1v) |
(+street:poland +street:street +street:w1v))
標識四個詞條都須要出如今相同的字段中

以詞條爲中心

經過查詢

GET /test_index/_validate/query?explain
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "operator": "and", 
      "fields": ["street", "city", "country", "postcode"]
    }
  }
}

獲得：

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "test_index",
      "valid" : true,
      "explanation" : "+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])"
    }
  ]
}

+blended(terms:[postcode:poland, country:poland, city:poland, street:poland]) +blended(terms:[postcode:street, country:street, city:street, street:street]) +blended(terms:[postcode:w1v, country:w1v, city:w1v, street:w1v])
這個是規則。換言之全部的詞必須出如今任意字段中。
cross_fields類型首先會解析查詢字符串來獲得一個詞條列表，而後在任一字段中搜索每一個詞條。經過混合字段的倒排文檔頻度來解決詞條頻度問題。從而完美結局了most_fields的問題。
使用cross_fields相比較於copy_to，能夠在查詢期間對個別字段進行加權。
示例：

GET /test_index/_search
{
  "query": {
    "multi_match": {
      "query": "Poland Street W1V",
      "type": "cross_fields", 
      "fields": ["street^2", "city", "country", "postcode"]
    }
  }
}

這樣street字段的boost就是2，其它字段都爲1