聚合用於分析查詢結果集的統計指標,咱們以觀看日誌分析爲例,介紹各類經常使用的ElasticSearch聚合操做。html
目錄:sql
首先展現一下咱們要分析的文檔結構:json
{ "video_id": 1289643545120062253, // 視頻id "video_uid": 3931482202390368051, // 視頻發佈者id "uid": 47381776787453866, // 觀看用戶id "time": 1533891263224, // 時間發生時間 "watch_duration": 30 // 觀看時長 }
每一個文檔記錄了一個觀看事件,咱們經過聚合分析用戶的觀看行爲。elasticsearch
ElasticSearch引入了兩個相關概念:ide
首先用sql語句描述這個查詢:ui
SELECT uid, count(*) as view_count FROM view_log WHERE time >= #{since} AND time <= #{to} GROUP BY uid;
ES 查詢:日誌
GET /view_log/_search { "size" : 0, "query": { "range": { "time": { "gte": 0, // since "lte": 0 // to } } }, "aggs": { "agg": { // agg爲聚合的名稱 "terms": { // 聚合的條件爲 uid 相同 "field": "uid" } } } }
response:code
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 100000, "max_score": 0, "hits": [] }, "aggregations": { "agg": { "buckets": [ { "key": 21836334489858688, "doc_count": 4026 }, { "key": 31489302390368051, "doc_count": 2717 } ] } }
result.aggregations.agg.buckets列表中包含了查詢的結果。視頻
由於咱們按照terms:uid進行聚合,每一個bucket爲uid相同的文檔集合,key字段即爲uid。htm
doc_count 字段代表bucket中文檔的數目即sql語句中的count(*) as view_count
。
咱們能夠爲查詢添加額外的統計指標, sql描述:
SELECT uid, count(*) as view_count, avg(watch_duration) as avg_duration FROM view_log WHERE time >= #{since} AND time <= #{to} GROUP BY uid;
ES 查詢:
GET /view_log/_search { "size" : 0, "query": { "range": { "time": { "gte": 0, // since "lte": 0 // to } } }, "aggs": { "agg": { // agg爲聚合的名稱 "terms": { // 聚合的條件爲 uid 相同 "field": "uid" }, "aggs": { // 添加統計指標(Metrics) "avg_duration": { "avg": { // 統計 watch_duration 的平均值 "field": "watch_duration" } } } } } }
response:
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 100000, "max_score": 0, "hits": [] }, "aggregations": { "agg": { "buckets": [ { "key": 21836334489858688, "doc_count": 4026, "avg_duration": { "value": 12778.882352941177 } }, { "key": 31489302390368051, "doc_count": 2717, "avg_duration": { "value": 2652.5714285714284 } } ] } }
avg_duration.value 表示 watch_duration 的平均值即該用戶的平均觀看時長。
在實際應用中用戶的數量很是驚人, 不可能經過一次查詢獲得所有結果所以咱們須要分頁器分批取回:
GET /view_log/_search { "size" : 0, "query": { "range": { "time": { "gte": 0, // since "lte": 0 // to } } }, "aggs": { "agg": { "terms": { "field": "uid", "size": 10000, // bucket 的最大個數 "include": { // 將聚合結果分爲10頁,序號爲[0,9], 取第一頁 "partition": 0, "num_partitions": 10 } }, "aggs": { "avg_duration": { "avg": { "field": "watch_duration" } } } } } }
上述查詢與上節的查詢幾乎徹底相同,只是在aggs.agg.terms字段中添加了include字段進行分頁。
uv是指觀看一個視頻的用戶數(unique visit),與此相對沒有按照用戶去重的觀看數稱爲pv(page visit)。
用SQL語句來描述:
SELECT video_id, count(*) as pv, count(distinct uid) as uv FROM view_log WHERE video_id = #{video_id};
ElasticSearch能夠方便的進行count(distinct)查詢:
GET /view_log/_search { "aggs": { "uv": { "cardinality": { "field": "uid" } } } }
response:
{ "took": 255, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 17579, "max_score": 0, "hits": [] }, "aggregations": { "uv": { "value": 11 } } }
ElasticSearch也能夠批量查詢count(distinct), 先用SQL進行描述:
SELECT video_id, count(*) as pv, count(distinct uid) as uv FROM view_log GROUP BY video_id;
查詢:
GET /view_log/_search { "size": 0, "aggs": { "video": { "terms": { "field": "video_id" }, "aggs": { "uv": { "cardinality": { "field": "uid" } } } } } }
response:
{ "took": 313, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 16940, "max_score": 0, "hits": [] }, "aggregations": { "video": { "buckets": [ { "key": 25417499722062, // 視頻id "doc_count": 427, // 視頻觀看次數 pv "uv": { "value": 124 // 觀看視頻的用戶數 uv } }, { "key": 72446898144, "doc_count": 744, "uv": { "value":233 } } ] } } }
SQL可使用HAVING語句根據聚合結果進行過濾,ElasticSearch可使用pipeline aggregations達到此效果不過語法較爲繁瑣。
使用SQL查詢觀看超過200次的視頻:
SELECT video_id, count(*) as view_count FROM view_log GROUP BY video_id HAVING count(*) > 200;
GET /view_log/_search { "size": 0, "aggs": { "view_count": { "terms": { "field": "video_id" }, "aggs": { "having": { "bucket_selector": { "buckets_path": { // 選擇 view_count 聚合的 doc_count 進行過濾 "view_count": "_count" }, "script": { "source": "params.view_count > 200" } } } } } } }
response:
{ "took": 83, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 775, "max_score": 0, "hits": [] }, "aggregations": { "view_count": { "buckets": [ { "key": 35025417499764062, "doc_count": 529 }, { "key": 19913672446898144, "doc_count": 759 } ] } } }
ElasticSearch實現相似HAVING查詢的關鍵在於使用bucket_selector選擇聚合結果進行過濾。
接下來咱們嘗試查詢平均觀看時長大於5分鐘的視頻, 用SQL描述該查詢:
SELECT video_id FROM view_log GROUP BY video_id HAVING avg(watch_duration) > 300;
GET /view_log/_search { "size": 0, "aggs": { "video": { "terms": { "field": "video_id" }, "aggs": { "avg_duration": { "avg": { "field": "watch_duration" } }, "avg_duration_filter": { "bucket_selector": { "buckets_path": { "avg_duration": "avg_duration" }, "script": { "source": "params.avg_duration > 200" } } } } } } }
response:
{ "took": 137, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 255, "max_score": 0, "hits": [] }, "aggregations": { "video": { "buckets": [ { "key": 5417499764062, "doc_count": 91576, "avg_duration": { "value": 103 } }, { "key": 19913672446898144, "doc_count": 15771, "avg_duration": { "value": 197 } } ] } } }