上一節完成數據導入後,接下來說講Druid如何查詢及統計分析導入的數據。
Druid的查詢是使用REST風格的HTTP請求查詢服務節點(Broker、Historical、Realtime),這些服務節點暴露REST查詢接口,客戶端發送Json對象請求查詢接口。通常狀況下,查詢服務接口發佈在Broker節點,基於Linux 的POST請求查詢以下所示:
- curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-Type:application/json' -d @<query_json_file>
二、Druid 查詢類型
Druid在不一樣場景下,有不少的查詢類型。查詢是由各類JSON屬性和Druid有不一樣類型的不一樣場景下查詢組成。對於各類類型的查詢類型的配置能夠json屬性文件設置。Druid查詢類型,歸納一下爲3大類:
1. 聚合查詢 - 時間序列查詢(Timeseries)、排名查詢(TopN)、分組查詢(GroupBy)
2. 元數據查詢 - 時間範圍(Time Boundary) 、段元數據(Segment Metadata)、數據源(Datasource)
3. Search查詢 - Search
本節以聚合查詢爲主,其它查詢類型比較簡單,使用上相對比較少,暫不介紹。對聚合查詢類型下的3種查詢如何選擇進行一下概述:
在可能的狀況下,咱們建議使用的時間序列和TopN查詢代替分組查詢,分組查詢是Druid最靈活的的查詢,可是性能最差。時間序列查詢是明顯快於GROUPBY查詢,由於聚合不須要分組尺寸。對於分組和排序在一個單一的維度,TopN查詢更優於GROUPBY。
2.1 Json查詢屬性
在講聚合查詢下的3種查詢類型以前,咱們須要對3種查詢類型共有的特別重要的Json屬性理解與熟悉,經常使用屬性如:queryType、dataSource、granularity、filter、aggregator等。
2.1.1 查詢類型(queryType)
對應聚合查詢下的3種類型值:timeseries、topN、groupBy
2.1.2 數據源(dataSource)
數據源,相似數據庫中表的概念,對應數據導入時Json配置屬性dataSource值
2.1.3 聚合粒度(granularity)
粒度決定如何獲得數據塊在跨時間維度,或者如何獲得按小時,天,分鐘的彙總等。在配置查詢聚合粒度裏有三種配置方法:
1. 簡單聚合粒度 - 支持字符串值有:all、none、second、minute、fifteen_minute、thirty_minute、hour、day、week、month、quarter、year
(1) all - 將全部塊變成一塊
(2) none - 不使用塊數據(它其實是使用最小索引的粒度,none意味着爲毫秒級的粒度);按時間序列化查詢時不建議使用none,由於全部的毫秒不存在,系統也將嘗試生成0值,這每每是不少。
2. 時間段聚合粒度 - Druid指定一精確的持續時間(毫秒)和時間綴返回UTC(世界標準時間)。
3. 經常使用時間段聚合粒度 - 與時間段聚合粒度差很少,可是經常使用時間指平時咱們經常使用時間段,如年、月、周、小時等。
下面對3種聚合粒度配置舉例說明:
簡單聚合粒度正則表達式
查詢粒度比數據採集時配置的粒度小,則不合理,也無心義,因較小粒度(相比)者無索引數據;如
查詢粒度小於採集時配置的查詢粒度時,則Druid的查詢結果與採集數據配置的查詢粒度結果同樣。數據庫
假設咱們存儲在Druid的數據使用毫秒粒度獲取,數據以下:express
- {"timestamp": "2013-08-31T01:02:33Z", "page": "AAA", "language" : "en"}
- {"timestamp": "2013-09-01T01:02:33Z", "page": "BBB", "language" : "en"}
- {"timestamp": "2013-09-02T23:32:45Z", "page": "CCC", "language" : "en"}
- {"timestamp": "2013-09-03T03:32:45Z", "page": "DDD", "language" : "en"}
以"小時" 粒度提交一個groupby查詢,查詢配置以下:json
- {
- "queryType":"groupBy",
- "dataSource":"dataSource",
- "granularity":"hour",
- "dimensions":[
- "language"
- ],
- "aggregations":[
- {
- "type":"count",
- "name":"count"
- }
- ],
- "intervals":[
- "2000-01-01T00:00Z/3000-01-01T00:00Z"
- ]
- }
按小時粒度進行的groupby查詢結果中timestamp值精確到小時間,比小時粒度更小粒度值自動補填零,數組
以此類推按天查詢,則小時及小粒度補零。timestamp值爲UTCapp
- [ {
- "version" : "v1",
- "timestamp" : "2013-08-31T01:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-01T01:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-02T23:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-03T03:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- } ]
若是指定查詢粒度爲 none,則返回結果與數據導入時設置粒度(queryGranularity屬性值)結果同樣,
此處的導入粒度爲毫秒,結果以下:curl
- [ {
- "version" : "v1",
- "timestamp" : "2013-08-31T01:02:33.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-01T01:02:33.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-02T23:32:45.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-03T03:32:45.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- } ]
若是指定查詢粒度爲 all,返回數組長度結果爲1,結果以下:ide
- [ {
- "version" : "v1",
- "timestamp" : "2000-01-01T00:00:00.000Z",
- "event" : {
- "count" : 4,
- "language" : "en"
- }
- } ]
時間段聚合粒度 函數
指定一個精確時間持續時長(毫秒錶示)及時間綴,返回UTC時間;支持可選項屬性origin,不指定時
默認開始時間(1970-01-01T00:00:00Z)
- {"type": "duration", "duration": 7200000}
- {"type": "duration", "duration": 3600000, "origin": "2012-01-01T00:30:00Z"}
以上簡單聚合粒度的示例數據爲例,提交groupby查詢,持續時間段爲24小時,查詢配置以下:
- {
- "queryType":"groupBy",
- "dataSource":"dataSource",
- "granularity":{"type": "duration", "duration": "86400000"},
- "dimensions":[
- "language"
- ],
- "aggregations":[
- {
- "type":"count",
- "name":"count"
- }
- ],
- "intervals":[
- "2000-01-01T00:00Z/3000-01-01T00:00Z"
- ]
- }
查詢結果:
- [ {
- "version" : "v1",
- "timestamp" : "2013-08-31T00:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-01T00:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-02T00:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- }, {
- "version" : "v1",
- "timestamp" : "2013-09-03T00:00:00.000Z",
- "event" : {
- "count" : 1,
- "language" : "en"
- }
- } ]
經常使用時間段聚合粒度
略...
2.1.4 過濾(Filters)
一個Filter就是一個Json對象,用於過濾數據行過濾,相似SQL中的Where子句。過濾器類型有以下:Selector filte、Regular expression filter(正則表達式過濾)、Logical expression filters(AND、OR、NOT)、In filter、Bound filter、Search filter、JavaScript filter、Extraction filter
示例簡單查看使用方式:
查詢過濾(Selector filte)
等價於:WHERE <dimension_string> = '<dimension_value_string>'
- "filter": { "type": "selector", "dimension": <dimension_string>, "value": <dimension_value_string> }
正則表達過濾(Regular expression filter)
與Selector filte差很少,只是這裏使用正則表達式,表達式爲標準的Java正則表達式規範
- "filter": { "type": "regex", "dimension": <dimension_string>, "pattern": <pattern_string> }
邏緝表達過濾(Logical expression filters)
AND
- "filter": { "type": "and", "fields": [<filter>, <filter>, ...] }
OR
- "filter": { "type": "or", "fields": [<filter>, <filter>, ...] }
NOT
- "filter": { "type": "not", "field": <filter> }
IN過濾(In filter)
SQL查詢
- SELECT COUNT(*) AS 'Count' FROM `table` WHERE `outlaw` IN ('Good', 'Bad', 'Ugly')
Druid IN 過濾表示
- {
- "type": "in",
- "dimension": "outlaw",
- "values": ["Good", "Bad", "Ugly"]
- }
範圍過濾(Bound filter)
Bound filter 過濾比較值大小或小於某值,默認按字符串比較,使用數據比較須要設置alphaNumeric 屬
性爲true;默認 Bound filter爲非嚴格性(類閉區間),如 inputString <= upper && inputSting >= lower
- {
- "type": "bound",
- "dimension": "age",
- "lower": "21",
- "upper": "31" ,
- "alphaNumeric": true
- }
上述表示等價如:21 <= age <= 31
Bound filter 嚴格性,須要設置lowerStrict or/and upperStrict 屬性值爲true以下:
- {
- "type": "bound",
- "dimension": "age",
- "lower": "21",
- "lowerStrict": true,
- "upper": "31" ,
- "upperStrict": true,
- "alphaNumeric": true
- }
等價如:21 < age < 31
2.1.5 聚合(Aggregations)
聚合能夠在採集時間時規格部分的一種方式,彙總數據進入Druid以前提供。聚合也能夠被指定爲在查詢時多查詢的部分,聚合類型以下:Count aggregator、Sum aggregators、Min / Max aggregators、Approximate Aggregations、Miscellaneous Aggregations
Count aggregator
查詢返回匹配過濾條件的數據行數,須要注意的是:Druid進行Count查詢的數據量並不必定等於數據採
集時導入的數據量,由於Druid在採集數據並導入時已經對數據進行了聚合。
- { "type" : "count", "name" : <output_name> }
Sum aggregator
longSum aggregator:計算值爲有符號位64位整數
- { "type" : "longSum", "name" : <output_name>, "fieldName" : <metric_name> }
doubleSum aggregator:與longSum相似,計算值爲64位浮點型
- { "type" : "doubleSum", "name" : <output_name>, "fieldName" : <metric_name> }
Min / Max aggregators
doubleMin aggregator
- { "type" : "doubleMin", "name" : <output_name>, "fieldName" : <metric_name> }
doubleMax aggregator
- { "type" : "doubleMax", "name" : <output_name>, "fieldName" : <metric_name> }
longMin aggregator
- { "type" : "longMin", "name" : <output_name>, "fieldName" : <metric_name> }
longMax aggregator
- { "type" : "longMax", "name" : <output_name>, "fieldName" : <metric_name> }
相似聚合(Approximate Aggregations)
基數聚合(Cardinality aggregator)
計算Druid多種維度基數,Cardinality aggregator使用HyperLogLog評估基數,這種聚合比帶有索引的
hyperUnique聚合慢,運行在一個維度列,意味着不能從數據集中刪除字符串維度來提升聚合;通常咱們
強力推薦使用hyperUnique aggregator而不是Cardinality aggregator,格式以下:
- {
- "type": "cardinality",
- "name": "<output_name>",
- "fieldNames": [ <dimension1>, <dimension2>, ... ],
- "byRow": <false | true> # (optional, defaults to false)
- }
. 維度值聚合-當設置屬性byRow爲false(默認值)時,經過合併全部給定的維度列來計算值集合。
對於單維度,等價以下:
- SELECT COUNT(DISTINCT(dimension)) FROM <datasource>
對於多維度,等價以下:
- SELECT COUNT(DISTINCT(value)) FROM (
- SELECT dim_1 as value FROM <datasource>
- UNION
- SELECT dim_2 as value FROM <datasource>
- UNION
- SELECT dim_3 as value FROM <datasource>
- )
. 行聚合-當設置屬性byRow爲true時,根所不一樣維度的值合併來計算行值,等價以下:
- SELECT COUNT(*) FROM ( SELECT DIM1, DIM2, DIM3 FROM <datasource> GROUP BY DIM1, DIM2, DIM3 )
許多不一樣國家的人出生地或來自哪裏,用druid配置以下:
- {
- "type": "cardinality",
- "name": "distinct_countries",
- "fieldNames": [ "coutry_of_origin", "country_of_residence" ]
- }
HyperUnique aggregator
已經被「hyperunique」在建立索引時聚合的維度值使用HyperLogLog計算估計,更多資料請參考官網
- { "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }
後聚合(post-aggregators)
後聚合是對Druid進行聚合後的值進行聚全,若是查詢中包括一個後聚合,那麼確保全部聚合知足後聚合要求;後聚合有如下幾種類型:
1. Arithmetic post-aggregators
2. Field accessor post-aggregator
3. Constant post-aggregator
4. JavaScript post-aggregator
5. HyperUnique Cardinality post-aggregator
Arithmetic post-aggregators
算術後聚合應用已提供的函數從左到右獲取字段,這些字段可聚合或後聚合;支持+
, -
, *
, /
, and quotient。
算術後聚合能夠指定ordering屬性,用於聚合結果排序(對topN查詢頗有用 ):
(1) 若是無ordering屬性(或null),使用默認的浮點排序。
(2) numericFirst 首先返回有限值,其次是NaN,最後返回無限值。
算術後聚合語法以下:
- postAggregation : {
- "type" : "arithmetic",
- "name" : <output_name>,
- "fn" : <arithmetic_function>,
- "fields": [<post_aggregator>, <post_aggregator>, ...],
- "ordering" : <null (default), or "numericFirst">
- }
Field accessor post-aggregator - fieldName引用aggregator定義的名稱
- { "type" : "fieldAccess", "name": <output_name>, "fieldName" : <aggregator_name> }
Constant post-aggregator - 返回指定值
- { "type" : "constant", "name" : <output_name>, "value" : <numerical_value> }
2.2 時間序列查詢(Timeseries)
這些類型的查詢以時間序列查詢對象和返回一個JSON數組對象,每一個對象表示時間序列查詢的值,時間序列查詢請求的Json的7個主要屬性以下:
屬性 |
描述 |
必填項 |
queryType |
字符串類型,時間序列 "timeseries" |
是 |
dataSource |
字符串類型,數據源(相似數據庫表) |
是 |
descending |
排序標誌,默認爲 "false"(升序) |
否 |
intervals |
查詢時間範圍跨度,JSON對象,ISO-8601區間 |
是 |
granularity |
定義查詢結果塊粒度 |
是 |
filter |
過濾條件 |
否 |
aggregations |
聚合 |
是 |
postAggregations |
後聚合 |
否 |
context |
上下文 |
否 |
- {
- "queryType": "timeseries",
- "dataSource": "sample_datasource",
- "granularity": "day",
- "descending": "true",
- "filter": {
- "type": "and",
- "fields": [
- { "type": "selector", "dimension": "sample_dimension1", "value": "sample_value1" },
- { "type": "or",
- "fields": [
- { "type": "selector", "dimension": "sample_dimension2", "value": "sample_value2" },
- { "type": "selector", "dimension": "sample_dimension3", "value": "sample_value3" }
- ]
- }
- ]
- },
- "aggregations": [
- { "type": "longSum", "name": "sample_name1", "fieldName": "sample_fieldName1" },
- { "type": "doubleSum", "name": "sample_name2", "fieldName": "sample_fieldName2" }
- ],
- "postAggregations": [
- { "type": "arithmetic",
- "name": "sample_divide",
- "fn": "/",
- "fields": [
- { "type": "fieldAccess", "name": "postAgg__sample_name1", "fieldName": "sample_name1" },
- { "type": "fieldAccess", "name": "postAgg__sample_name2", "fieldName": "sample_name2" }
- ]
- }
- ],
- "intervals": [ "2012-01-01T00:00:00.000/2012-01-03T00:00:00.000" ]
- }
上述配置了過濾條件,2個聚合,後聚合器將2個聚合結果進行相除。查詢結果以下,查詢結果存儲在屬性result,以鍵值對方式存儲:
- [
- {
- "timestamp": "2012-01-01T00:00:00.000Z",
- "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
- },
- {
- "timestamp": "2012-01-02T00:00:00.000Z",
- "result": { "sample_name1": <some_value>, "sample_name2": <some_value>, "sample_divide": <some_value> }
- }
- ]
2.3 排名查詢(TopN query)
TopN查詢根據規範返回給定維度的有序的結果集,從概念上來說,TopN查詢被認爲單維度、有序的相似分組查詢。在某些狀況下,TopN查詢比分組查詢(groupby query)快。TopN查詢結果返回Json數組對象。
TopN在每一個節點將頂上K個結果排名,在Druid默認狀況下最大值爲1000。在實踐中,若是你要求前1000個項順序排名,那麼從第1-999個項的順序正確性是100%,其後項的結果順序沒有保證。你能夠經過增長threshold值來保證順序準確。
屬性 |
描述 |
必填項 |
queryType |
字符串類型,時間序列 "topN" |
是 |
dataSource |
字符串類型,數據源(相似數據庫表) |
是 |
intervals |
查詢時間範圍跨度,JSON對象,ISO-8601區間 |
是 |
granularity |
定義查詢結果塊粒度 |
是 |
filter |
過濾條件 |
否 |
aggregations |
聚合 |
是 |
postAggregations |
後聚合 |
否 |
dimension |
查詢的維度(列) |
是 |
threshold |
返回Top N個結果 |
是 |
metric |
字符串或Json對象指定度量對Top N個結果排序 |
是 |
context |
上下文 |
否
|
Metric
屬性 |
描述 |
必填項 |
type |
數字排序 |
是 |
metric |
排序字段 |
是
|
數據排序(Numeric TopNMetricSpec) - 最簡單的規範指定一個字符串值指示排序TopN結果的度量
- "metric": "<metric_name>"
metric屬性一般配置爲Json對象,上述等價於:
- "metric": {
- "type": "numeric",
- "metric": "<metric_name>"
- }
topN query 配置示例以下:
- {
- "queryType": "topN",
- "dataSource": "sample_data",
- "dimension": "sample_dim",
- "threshold": 5,
- "metric": "count",
- "granularity": "all",
- "filter": {
- "type": "and",
- "fields": [
- {
- "type": "selector",
- "dimension": "dim1",
- "value": "some_value"
- },
- {
- "type": "selector",
- "dimension": "dim2",
- "value": "some_other_val"
- }
- ]
- },
- "aggregations": [
- {
- "type": "longSum",
- "name": "count",
- "fieldName": "count"
- },
- {
- "type": "doubleSum",
- "name": "some_metric",
- "fieldName": "some_metric"
- }
- ],
- "postAggregations": [
- {
- "type": "arithmetic",
- "name": "sample_divide",
- "fn": "/",
- "fields": [
- {
- "type": "fieldAccess",
- "name": "some_metric",
- "fieldName": "some_metric"
- },
- {
- "type": "fieldAccess",
- "name": "count",
- "fieldName": "count"
- }
- ]
- }
- ],
- "intervals": [
- "2013-08-31T00:00:00.000/2013-09-03T00:00:00.000"
- ]
- }
查詢前Top 5個結果,按count排序:
- [
- {
- "timestamp": "2013-08-31T00:00:00.000Z",
- "result": [
- {
- "dim1": "dim1_val",
- "count": 111,
- "some_metrics": 10669,
- "average": 96.11711711711712
- },
- {
- "dim1": "another_dim1_val",
- "count": 88,
- "some_metrics": 28344,
- "average": 322.09090909090907
- },
- {
- "dim1": "dim1_val3",
- "count": 70,
- "some_metrics": 871,
- "average": 12.442857142857143
- },
- {
- "dim1": "dim1_val4",
- "count": 62,
- "some_metrics": 815,
- "average": 13.14516129032258
- },
- {
- "dim1": "dim1_val5",
- "count": 60,
- "some_metrics": 2787,
- "average": 46.45
- }
- ]
- }
- ]