Elasticsearch如何實現SQL語句中 Group By 和 Limit 的功能

時間 2019-11-07

標籤 elasticsearch 如何實現 sql 語句 group limit 功能欄目日誌分析简体版

原文原文鏈接

有 SQL 背景的同窗在學習 Elasticsearch 時，面對一個查詢需求，不禁自主地會先思考如何用 SQL 來實現，而後再去想 Elasticsearch 的 Query DSL 如何實現。那麼本篇就給你們講一條常見的 SQL 語句如何用 Elasticsearch 的查詢語言實現。mysql

1. SQL語句

假設咱們有一個汽車的數據集，每一個汽車都有車型、顏色等字段，我但願獲取顏色種類大於1個的前2車型。假設汽車的數據模型以下：sql

{
    "model":"modelA",
    "color":"red"
}

假設咱們有一個 cars 表，經過以下語句建立測試數據。json

INSERT INTO cars (model,color) VALUES ('A','red'); 
INSERT INTO cars (model,color) VALUES ('A','white'); 
INSERT INTO cars (model,color) VALUES ('A','black'); 
INSERT INTO cars (model,color) VALUES ('A','yellow'); 
INSERT INTO cars (model,color) VALUES ('B','red'); 
INSERT INTO cars (model,color) VALUES ('B','white'); 
INSERT INTO cars (model,color) VALUES ('C','black'); 
INSERT INTO cars (model,color) VALUES ('C','red'); 
INSERT INTO cars (model,color) VALUES ('C','white'); 
INSERT INTO cars (model,color) VALUES ('C','yellow'); 
INSERT INTO cars (model,color) VALUES ('C','blue'); 
INSERT INTO cars (model,color) VALUES ('D','red');
INSERT INTO cars (model,color) VALUES ('A','red');

那麼實現咱們需求的 SQL 語句也比較簡單，實現以下：elasticsearch

SELECT model,COUNT(DISTINCT color) color_count FROM cars GROUP BY model HAVING color_count > 1 ORDER BY color_count desc LIMIT 2;

這條查詢語句中 Group By 是按照 model 作分組， Having color_count>1 限定了車型顏色種類大於1，ORDER BY color_count desc 限定結果按照顏色種類倒序排列，而 LIMIT 2 限定只返回前3條數據。學習

那麼在 Elasticsearch 中如何實現這個需求呢？測試

2. 在 Elasticsearch 模擬測試數據

首先咱們須要先在 elasticsearch 中插入測試的數據，這裏咱們使用 bulk 接口 ，以下所示：code

POST _bulk
{"index":{"_index":"cars","_type":"doc","_id":"1"}}
{"model":"A","color":"red"}
{"index":{"_index":"cars","_type":"doc","_id":"2"}}
{"model":"A","color":"white"}
{"index":{"_index":"cars","_type":"doc","_id":"3"}}
{"model":"A","color":"black"}
{"index":{"_index":"cars","_type":"doc","_id":"4"}}
{"model":"A","color":"yellow"}
{"index":{"_index":"cars","_type":"doc","_id":"5"}}
{"model":"B","color":"red"}
{"index":{"_index":"cars","_type":"doc","_id":"6"}}
{"model":"B","color":"white"}
{"index":{"_index":"cars","_type":"doc","_id":"7"}}
{"model":"C","color":"black"}
{"index":{"_index":"cars","_type":"doc","_id":"8"}}
{"model":"C","color":"red"}
{"index":{"_index":"cars","_type":"doc","_id":"9"}}
{"model":"C","color":"white"}
{"index":{"_index":"cars","_type":"doc","_id":"10"}}
{"model":"C","color":"yellow"}
{"index":{"_index":"cars","_type":"doc","_id":"11"}}
{"model":"C","color":"blue"}
{"index":{"_index":"cars","_type":"doc","_id":"12"}}
{"model":"D","color":"red"}
{"index":{"_index":"cars","_type":"doc","_id":"13"}}
{"model":"A","color":"red"}

其中 index 爲 cars，type 爲 doc，全部數據與mysql 數據保持一致。你們能夠在 Kibana 的 Dev Tools 中執行上面的命令，而後執行下面的查詢語句驗證數據是否已經成功存入。排序

GET cars/_search

3. Group By VS Terms/Metric Aggregation

SQL 中 Group By 語句在 Elasticsearch 中對應的是 Terms Aggregation，即分桶聚合，對應 Group By color 的語句以下所示：接口

GET cars/_search
{
  "size":0,
  "aggs":{
    "models":{
      "terms":{
        "field":"model.keyword"
      }
    }
  }
}

結果以下：ip

{
  "took": 161,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 13,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "models": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "A",
          "doc_count": 5
        },
        {
          "key": "C",
          "doc_count": 5
        },
        {
          "key": "B",
          "doc_count": 2
        },
        {
          "key": "D",
          "doc_count": 1
        }
      ]
    }
  }
}

咱們看 aggregations 這個 key 下面的即爲返回結果。

SQL 語句中還有一項是 COUNT(DISTINCT color) color_count 用於計算每一個 model 的顏色數，在 Elasticsearch 中咱們須要使用一個指標類聚合 Cardinality ，進行不一樣值計數。語句以下：

GET cars/_search
{
  "size": 0,
  "aggs": {
    "models": {
      "terms": {
        "field": "model.keyword"
      },
      "aggs": {
        "color_count": {
          "cardinality": {
            "field": "color.keyword"
          }
        }
      }
    }
  }
}

其返回結果以下：

{
  "took": 74,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 13,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "models": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "A",
          "doc_count": 5,
          "color_count": {
            "value": 4
          }
        },
        {
          "key": "C",
          "doc_count": 5,
          "color_count": {
            "value": 5
          }
        },
        {
          "key": "B",
          "doc_count": 2,
          "color_count": {
            "value": 2
          }
        },
        {
          "key": "D",
          "doc_count": 1,
          "color_count": {
            "value": 1
          }
        }
      ]
    }
  }
}

結果中 color_count 即爲每一個 model 的顏色數，但這裏全部的模型都返回了，咱們只想要顏色數大於1的模型，所以這裏還要加一個過濾條件。

4. Having Condition VS Bucket Filter Aggregation

Having color_count > 1 在 Elasticsearch 中對應的是 Bucket Filter 聚合，語句以下所示：

GET cars/_search
{
  "size": 0,
  "aggs": {
    "models": {
      "terms": {
        "field": "model.keyword"
      },
      "aggs": {
        "color_count": {
          "cardinality": {
            "field": "color.keyword"
          }
        },
        "color_count_filter": {
          "bucket_selector": {
            "buckets_path": {
              "colorCount": "color_count"
            },
            "script": "params.colorCount>1"
          }
        }
      }
    }
  }
}

返回結果以下：

{
  "took": 39,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 13,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "models": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "A",
          "doc_count": 5,
          "color_count": {
            "value": 4
          }
        },
        {
          "key": "C",
          "doc_count": 5,
          "color_count": {
            "value": 5
          }
        },
        {
          "key": "B",
          "doc_count": 2,
          "color_count": {
            "value": 2
          }
        }
      ]
    }
  }
}

此時返回結果只包含顏色數大於1的模型，但你們會發現顏色數多的 C 不是在第一個位置，咱們還須要作排序處理。

5. Order By Limit VS Bucket Sort Aggregation

ORDER BY color_count desc LIMIT 3 在 Elasticsearch 中可使用 Bucket Sort 聚合實現，語句以下所示：

GET cars/_search
{
  "size": 0,
  "aggs": {
    "models": {
      "terms": {
        "field": "model.keyword"
      },
      "aggs": {
        "color_count": {
          "cardinality": {
            "field": "color.keyword"
          }
        },
        "color_count_filter": {
          "bucket_selector": {
            "buckets_path": {
              "colorCount": "color_count"
            },
            "script": "params.colorCount>1"
          }
        },
        "color_count_sort": {
          "bucket_sort": {
            "sort": {
              "color_count": "desc"
            },
            "size": 2
          }
        }
      }
    }
  }
}

返回結果以下：

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 13,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "models": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "C",
          "doc_count": 5,
          "color_count": {
            "value": 5
          }
        },
        {
          "key": "A",
          "doc_count": 5,
          "color_count": {
            "value": 4
          }
        }
      ]
    }
  }
}

至此咱們便將 SQL 語句實現的功能用 Elasticsearch 查詢語句實現了。對比 SQL 語句與 Elasticsearch 的查詢語句，你們會發現後者複雜了不少，但並不是無章可循，隨着你們對常見語法愈來愈熟悉，相信必定會越寫越駕輕就熟！