Elasticsearch實踐（二）：搜索

時間 2019-12-10

原文原文鏈接

本文以 Elasticsearch 6.2.4爲例。html

通過前面的基礎入門，咱們對ES的基本操做也會了。如今來學習ES最強大的部分：全文檢索。git

準備工做

批量導入數據

先須要準備點數據，而後導入：github

wget https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/account/_bulk" --data-binary "@accounts.json"

這樣咱們就導入了1000條數據到ES。web

注意：accounts.json每行必須以\n換行。若是提示The bulk request must be terminated by a newline [\n]，請檢查最後一行是否以\n換行。json

index是bank。咱們能夠查看如今有哪些index：數組

curl "localhost:9200/_cat/indices?format=json&pretty"

結果：app

[
  {
    "health" : "yellow",
    "status" : "open",
    "index" : "bank",
    "uuid" : "MDxR02uESgKSynX6k8B-og",
    "pri" : "5",
    "rep" : "1",
    "docs.count" : "1000",
    "docs.deleted" : "0",
    "store.size" : "474.6kb",
    "pri.store.size" : "474.6kb"
  }
]

使用kibana可視化數據

該小節是可選的，若是不感興趣，能夠跳過。less

該小節要求你已經搭建好了ElasticSearch + Kibana。curl

打開kibana web地址：http://127.0.0.1:5601，依次打開：Management
-> Kibana -> Index Patterns ,選擇Create Index Pattern：elasticsearch

a. Index pattern 輸入：bank ；

b. 點擊Create。

而後打開Discover，選擇 bank 就能看到剛纔導入的數據了。

咱們在可視化界面裏檢索數據：

是否是很酷！

接下來咱們使用API來實現檢索。

查詢

URI檢索

uri檢索是經過提供請求參數純粹使用URI來執行搜索請求。

GET /bank/_search?q=Virginia&pretty
GET /bank/_search?q=firstname:Virginia

curl:

curl -XGET "localhost:9200/bank/_search?q=Virginia&pretty"
curl -XGET "localhost:9200/bank/_search?q=firstname:Virginia&pretty"

解釋：檢索關鍵字爲"Virginia"的結果。結果示例：

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 4.631368,
    "hits": [
      {
        "_index": "bank",
        "_type": "account",
        "_id": "298",
        "_score": 4.631368,
        "_source": {
          "account_number": 298,
          "balance": 34334,
          "firstname": "Bullock",
          "lastname": "Marsh",
          "age": 20,
          "gender": "M",
          "address": "589 Virginia Place",
          "employer": "Renovize",
          "email": "bullockmarsh@renovize.com",
          "city": "Coinjock",
          "state": "UT"
        }
      },
      {
        "_index": "bank",
        "_type": "account",
        "_id": "25",
        "_score": 4.6146765,
        "_source": {
          "account_number": 25,
          "balance": 40540,
          "firstname": "Virginia",
          "lastname": "Ayala",
          "age": 39,
          "gender": "F",
          "address": "171 Putnam Avenue",
          "employer": "Filodyne",
          "email": "virginiaayala@filodyne.com",
          "city": "Nicholson",
          "state": "PA"
        }
      }
    ]
  }
}

返回字段含義：

took – Elasticsearch執行搜索的時間（以毫秒爲單位）
timed_out – 搜索是否超時
_shards – 搜索了多少個分片，以及搜索成功/失敗分片的計數
hits – 搜索結果，是個對象
hits.total – 符合咱們搜索條件的文檔總數
hits.hits – 實際的搜索結果數組（默認爲前10個文檔）
hits.sort - 對結果進行排序（若是按score排序則沒有該字段）
hits._score、max_score - 暫時忽略這些字段

參數：

q 查詢字符串（映射到query_string查詢）
df 在查詢中未定義字段前綴時使用的默認字段。
analyzer 分析查詢字符串時要使用的分析器名稱。
sort 排序。能夠是fieldName或 fieldName:asc/ 的形式fieldName:desc。fieldName能夠是文檔中的實際字段，也能夠是特殊_score名稱，表示基於分數的排序。能夠有幾個sort參數（順序很重要）。
timeout 搜索超時。默認爲無超時。
from 從命中的索引開始返回。默認爲0。
size 要返回的點擊次數。默認爲10。
default_operator 要使用的默認運算符能夠是AND或 OR。默認爲OR。

詳見： https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-uri-request.html

示例：

GET /bank/_search?q=*&sort=account_number:asc&pretty

解釋：全部結果經過account_number字段升序排列。默認只返回前10條。

下面的查詢與上面的含義一致：

GET /bank/_search
{
  "query": {
        "multi_match" : {
            "query" : "Virginia",
            "fields" : ["_all"]
        }
    }
}

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

一般咱們會採用傳JSON方式查詢。Elasticsearch提供了一種JSON樣式的特定於域的語言，可用於執行查詢。這被稱爲查詢DSL。

注意：上述的查詢裏面咱們僅指定了index，並無指定type，那麼ES將不會區分type。若是想區分，請在URI後面追加type。示例：GET /bank/account/_search。

match查詢

GET /bank/_search
{
    "query" : {
        "match" : { "address" : "Avenue" }
    }
}

curl:

curl -XGET -H "Content-Type: application/json" "localhost:9200/bank/_search?pretty" -d '{"query":{"match":{"address":"Avenue"}}}'

上述查詢返回結果是address含有Avenue的結果。

term查詢

GET /bank/_search
{
    "query" : {
        "term" : { "address" : "Avenue" }
    }
}

curl:

curl -XGET -H "Content-Type: application/json" "localhost:9200/bank/_search?pretty" -d '{"query":{"term":{"address":"Avenue"}}}'

上述查詢返回結果是address等於Avenue的結果。

注：若是一個字段既須要分詞搜索，又須要精準匹配，最好是一開始設置mapping的時候就設置正確。例如：經過增長.keyword字段來支持精準匹配：

{
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
        }
    }
}

這樣至關於有address和address.keyword兩個字段。這個後面mapping章節再講解。

分頁(from/size)

分頁使用關鍵字from、size，分別表示偏移量、分頁大小。

GET /bank/_search
{
  "query": { "match_all": {} },
  "from": 0,
  "size": 2
}

from默認是0，size默認是10。

注意：ES的from、size分頁不是真正的分頁，稱之爲淺分頁。from+ size不能超過index.max_result_window 默認爲10,000 的索引設置。有關更有效的深度滾動方法，請參閱 Scroll或 Search After API。

排序(sort)

字段排序關鍵字是sort。支持升序(asc)、降序(desc)。默認是對_score字段進行排序。

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "from":0,
  "size":10
}

多個字段排序：

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" },
    { "_score": "asc" }
  ],
  "from":0,
  "size":10
}

先按照account_number排序，再按照_score排序。

按腳本排序

容許基於自定義腳本進行排序，這是一個示例：

GET bank/account/_search
{
    "query": { "range": { "age":  {"gt": 20} }},
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "source": "doc['account_number'].value * params.factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

上述查詢是使用腳本進行排序：按 account_number*1.1 的結果進行升序。其中lang指的是使用的腳本語言類型爲painless。painless支持Math.log函數。

上述例子僅僅是演示使用方法，沒有實際含義。

過濾字段

默認狀況下，ES返回全部字段。這被稱爲源（_source搜索命中中的字段）。若是咱們不但願返回全部字段，咱們能夠只請求返回源中的幾個字段。

GET /bank/_search
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}

經過_source關鍵字能夠實現字段過濾。

返回腳本字段

能夠經過腳本動態返回新定義字段。示例：

GET bank/account/_search
{
    "query" : {
        "match_all": {}
    },
    "size":2,
    "script_fields" : {
        "age2" : {
            "script" : {
                "lang": "painless",
                "source": "doc['age'].value * 2"
            }
        },
        "age3" : {
            "script" : {
                "lang": "painless",
                "source": "params['_source']['age'] * params.factor",
                "params" : {
                    "factor"  : 2.0
                }
            }
        }
    }
}

結果：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 1,
    "hits": [
      {
        "_index": "bank",
        "_type": "account",
        "_id": "25",
        "_score": 1,
        "fields": {
          "age3": [
            78
          ],
          "age2": [
            78
          ]
        }
      },
      {
        "_index": "bank",
        "_type": "account",
        "_id": "44",
        "_score": 1,
        "fields": {
          "age3": [
            74
          ],
          "age2": [
            74
          ]
        }
      }
    ]
  }
}

注意：使用doc['my_field_name'].value比使用params['_source']['my_field_name']更快更效率，推薦使用。

AND查詢

若是咱們想同時查詢符合A和B字段的結果，該怎麼查呢？可使用must關鍵字組合。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}


GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "account_number":136 } },
        { "match": { "address": "lane" } },
        { "match": { "city": "Urie" } }
      ]
    }
  }
}

must也等價於：

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } }
      ],
      "must": [
        { "match": { "address": "lane" } }
      ]
    }
  }
}

這種至關於先查詢A再查詢B，而上面的則是同時查詢符合A和B，但結果是同樣的，執行效率可能有差別。有知道緣由的朋友能夠告知。

OR查詢

ES使用should關鍵字來實現OR查詢。

GET /bank/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "account_number":136 } },
        { "match": { "address": "lane" } },
        { "match": { "city": "Urie" } }
      ]
    }
  }
}

AND取反查

must_not關鍵字實現了既不包含A也不包含B的查詢。

GET /bank/_search
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }

表示 address 字段須要符合既不包含 mill 也不包含 lane。

布爾組合查詢

咱們能夠組合 must 、should 、must_not 進行復雜的查詢。

A AND NOT B

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": 40 } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

至關於SQL：

select * from bank where age=40 and state!= "ID";

A AND (B OR C)

GET /bank/_search
{
    "query":{
        "bool":{
            "must":[
                {"match":{"age":39}},
                {"bool":{"should":[
                            {"match":{"city":"Nicholson"}},
                            {"match":{"city":"Yardville"}}
                        ]}
                }
            ]
        }
    }
}

至關於SQL：

select * from bank where age=39 and (city="Nicholson" or city="Yardville");

範圍查詢

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

若是僅僅是單字段範圍查詢，也能夠直接省略 must、filter等關鍵字：

GET /bank/_search
{
    "query":{
        "range":{
            "balance":{
                "gte":20000,
                "lte":30000
            }
        }
    }
}

至關於SQL：

select * from bank where balance between 20000 and 30000;

多字段範圍查詢：

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "bool":{
          "must":[
            {"range": {"balance": {"gte": 20000,"lte": 30000}}},
            {"range": {"age": {"gte": 30}}}
            ]
        }
      }
    }
  }
}

高亮結果

ES能夠高亮返回結果裏的關鍵字，使用html標記標出。

GET bank/account/_search
{
    "query" : {
        "match": { "address": "Avenue" }
    },
    "from": 0,
    "size": 1,
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "*" : { }
        }
    }
}

輸出：

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 214,
    "max_score": 1.5814995,
    "hits": [
      {
        "_index": "bank",
        "_type": "account",
        "_id": "102",
        "_score": 1.5814995,
        "_source": {
          "account_number": 102,
          "balance": 29712,
          "firstname": "Dena",
          "lastname": "Olson",
          "age": 27,
          "gender": "F",
          "address": "759 Newkirk Avenue",
          "employer": "Hinway",
          "email": "denaolson@hinway.com",
          "city": "Choctaw",
          "state": "NJ"
        },
        "highlight": {
          "address": [
            "759 Newkirk <em>Avenue</em>"
          ]
        }
      }
    ]
  }
}

返回結果裏的highlight部分就是高亮結果，默認使用<em>標出。若是須要修改，可使用pre_tags設置修改：

"fields": {
    "*" : { "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}

*表明全部字段都高亮，也能夠只高亮具體的字段，直接用具體字段替換*便可。

require_field_match：默認狀況下，僅突出顯示包含查詢匹配的字段。設置require_field_match爲false突出顯示全部字段。默認爲true。詳見：https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-highlighting.html

聚合查詢

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}

結果：

{
  "took": 29,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound": 20,
      "sum_other_doc_count": 770,
      "buckets" : [ {
        "key" : "ID",
        "doc_count" : 27
      }, {
        "key" : "TX",
        "doc_count" : 27
      }, {
        "key" : "AL",
        "doc_count" : 25
      }, {
        "key" : "MD",
        "doc_count" : 25
      }, {
        "key" : "TN",
        "doc_count" : 23
      }, {
        "key" : "MA",
        "doc_count" : 21
      }, {
        "key" : "NC",
        "doc_count" : 21
      }, {
        "key" : "ND",
        "doc_count" : 21
      }, {
        "key" : "ME",
        "doc_count" : 20
      }, {
        "key" : "MO",
        "doc_count" : 20
      } ]
    }
  }
}

查詢結果返回了ID州(Idaho)有27個帳戶，TX州(Texas)有27個帳戶。

至關於SQL：

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

該查詢意思是按照字段state分組，返回前10個聚合結果。

其中size設置爲0意思是不返回文檔內容，僅返回聚合結果。state.keyword表示字段精確匹配，由於使用模糊匹配性能很低，因此不支持。

多重聚合

咱們能夠在聚合的基礎上再進行聚合，例如求和、求平均值等等。

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

上述查詢實現了在前一個聚合的基礎上，按州計算平均賬戶餘額（一樣僅針對按降序排序的前10個州）。

咱們能夠在聚合中任意嵌套聚合，以從數據中提取所需的統計數據。

在前一個聚合的基礎上，咱們如今按降序排列平均餘額：

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

這裏基於第二個聚合結果進行倒序排列。其實上一個例子隱藏了默認排序，也就是默認按照_sort(分值)倒序：

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "_sort": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

此示例演示了咱們如何按年齡段（20-29歲，30-39歲和40-49歲）進行分組，而後按性別分組，最後獲得每一個年齡段的平均賬戶餘額：

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender.keyword"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}

這個結果就複雜了，屬於嵌套分組，結果也是嵌套的：

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1000,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_age": {
      "buckets": [
        {
          "key": "20.0-30.0",
          "from": 20,
          "to": 30,
          "doc_count": 451,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "M",
                "doc_count": 232,
                "average_balance": {
                  "value": 27374.05172413793
                }
              },
              {
                "key": "F",
                "doc_count": 219,
                "average_balance": {
                  "value": 25341.260273972603
                }
              }
            ]
          }
        },
        {
          "key": "30.0-40.0",
          "from": 30,
          "to": 40,
          "doc_count": 504,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "F",
                "doc_count": 253,
                "average_balance": {
                  "value": 25670.869565217392
                }
              },
              {
                "key": "M",
                "doc_count": 251,
                "average_balance": {
                  "value": 24288.239043824702
                }
              }
            ]
          }
        },
        {
          "key": "40.0-50.0",
          "from": 40,
          "to": 50,
          "doc_count": 45,
          "group_by_gender": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "M",
                "doc_count": 24,
                "average_balance": {
                  "value": 26474.958333333332
                }
              },
              {
                "key": "F",
                "doc_count": 21,
                "average_balance": {
                  "value": 27992.571428571428
                }
              }
            ]
          }
        }
      ]
    }
  }
}

term與match查詢

首先你們看下面的例子有什麼區別：

已知條件：ES裏address爲171 Putnam Avenue的數據有1條；address爲Putnam的數據有0條。index爲bank，type爲account，文檔ID爲25。

GET /bank/_search
{
  "query": {
        "match" : {
            "address" : "Putnam"
        }
    }
}

GET /bank/_search
{
  "query": {
        "match" : {
            "address.keyword" : "Putnam"
        }
    }
}

GET /bank/_search
{
  "query": {
        "term" : {
            "address" : "Putnam"
        }
    }
}

結果：
一、第一個能匹配到數據，由於會分詞查詢。
二、第二個不能匹配到數據，由於不分詞的話沒有該條數據。
三、結果不肯定。須要看實際是怎麼分詞的。

咱們經過下列查詢能夠知曉該條數據字段address的分詞狀況:

GET /bank/account/25/_termvectors?fields=address

結果：

{
  "_index": "bank",
  "_type": "account",
  "_id": "25",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "address": {
      "field_statistics": {
        "sum_doc_freq": 591,
        "doc_count": 197,
        "sum_ttf": 591
      },
      "terms": {
        "171": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 3
            }
          ]
        },
        "avenue": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 11,
              "end_offset": 17
            }
          ]
        },
        "putnam": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 4,
              "end_offset": 10
            }
          ]
        }
      }
    }
  }
}

能夠看出該條數據字段address一共分了3個詞：

171
avenue
putnam

如今能夠得出第三個查詢的答案：匹配不到！但值改爲小寫的putnam又能匹配到了！

緣由是：

term query 查詢的是倒排索引中確切的term
match query 會對filed進行分詞操做，而後再查詢

因爲Putnam不在分詞裏（大小寫敏感），因此匹配不到。match query先對filed進行分詞，也就是分紅putnam，再去匹配倒排索引中的term,因此能匹配到。

standard analyzer 分詞器分詞默認會將大寫字母所有轉爲小寫字母。

參考

一、Getting Started | Elasticsearch Reference [6.2] | Elastic https://www.elastic.co/guide/en/elasticsearch/reference/6.2/getting-started.html 二、Elasticsearch 5.x 關於term query和match query的認識 - wangchuanfu - 博客園 https://www.cnblogs.com/wangchuanfu/p/7444253.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。