星型數據倉庫olap工具kylin介紹mysql
數據倉庫是目前企業級BI分析的重要平臺,尤爲在互聯網公司,天天都會產生數以百G的日誌,如何從這些日誌中發現數據的規律很重要. 數據倉庫是數據分析的重要工具, 每一個大公司都花費數百萬每一年的資金進行數據倉庫的運維.sql
本文介紹一個基於hadoop的數據倉庫, 它基於hadoop(HIVE, HBASE)水平擴展的特性, 客服傳統olap受限於關係型數據庫數據容量的問題. Kylin是ebay推出的olap星型數據倉庫的開源實現. 數據庫
首先請安裝Kylin, 和它的運行環境(Hadoop, yarn, hive, hbase). 若是安裝成功, 登錄(http://<KYLIN_HOST>:7070/), 用戶名:ADMIN, 密碼(KYLIN). 安裝過程請參考(http://kylin.incubator.apache.org/download/, 注意下載編譯後的二進制包, 免去不少編譯煩惱).apache
在建立數據倉庫前, 咱們先聊一下, 什麼是數據倉庫.restful
從業務過程的角度考慮, 信息系統能夠劃分爲兩個主要類別, 一類用於支持業務過程的執行, 表明做品是mysql; 另外一類用於支持業務過程的分析, 表明做品是hive, 還有就是今天的主角kylin.運維
下圖展現了一個簡單的基於訂單流程中事實和維度的星型模型.工具
這是一個典型的星型結構, 訂單的事實表有3個度量值(messures)(訂單數量, 訂單金額, 和訂單成本); 另外有4個度量維度(dimession), 分別是時間, 產品, 銷售員, 客戶. 這裏時間以天爲單位, 這裏注意day_key必須是(YYYY-MM-DD)格式(這是kylin的規定). oop
1. 建立事實表並插入數據ui
DROP TABLE IF EXISTS DEFAULT.fact_order ; create table DEFAULT.fact_order ( time_key string, product_key string, salesperson_key string, custom_key string, quantity_ordered bigint, order_dollars bigint, cost_dollars bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; load data local inpath 'fact_order.csv' overwrite into table DEFAULT.fact_order;
fact_order.csvspa
2015-05-01,pd001,sp001,ct001,100,101,51 2015-05-01,pd001,sp002,ct002,100,101,51 2015-05-01,pd001,sp003,ct002,100,101,51 2015-05-01,pd002,sp001,ct001,100,101,51 2015-05-01,pd003,sp001,ct001,100,101,51 2015-05-01,pd004,sp001,ct001,100,101,51 2015-05-02,pd001,sp001,ct001,100,101,51 2015-05-02,pd001,sp002,ct002,100,101,51 2015-05-02,pd001,sp003,ct002,100,101,51 2015-05-02,pd002,sp001,ct001,100,101,51 2015-05-02,pd003,sp001,ct001,100,101,51 2015-05-02,pd004,sp001,ct001,100,101,51
2. 建立天維度表day_dim
DROP TABLE IF EXISTS DEFAULT.dim_day ; create table DEFAULT.dim_day ( day_key string, full_day string, month_name string, quarter string, year string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; load data local inpath 'dim_day.csv' overwrite into table DEFAULT.dim_day;
dim_day.csv
2015-05-01,2015-05-01,201505,2015q2,2015 2015-05-02,2015-05-02,201505,2015q2,2015 2015-05-03,2015-05-03,201505,2015q2,2015 2015-05-04,2015-05-04,201505,2015q2,2015 2015-05-05,2015-05-05,201505,2015q2,2015
3. 建立售賣員的維度表salesperson_dim
DROP TABLE IF EXISTS DEFAULT.dim_salesperson ; create table DEFAULT.dim_salesperson ( salesperson_key string, salesperson string, salesperson_id string, region string, region_code string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; load data local inpath 'dim_salesperson.csv' overwrite into table DEFAULT.dim_salesperson;
dim_salesperson.csv
sp001,hongbin,sp001,beijing,10086 sp002,hongming,sp002,beijing,10086 sp003,hongmei,sp003,beijing,10086
4. 建立客戶維度 custom_dim
DROP TABLE IF EXISTS DEFAULT.dim_custom ; create table DEFAULT.dim_custom ( custom_key string, custom_name string, custorm_id string, headquarter_states string, billing_address string, billing_city string, billing_state string, industry_name string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; load data local inpath 'dim_custom.csv' overwrite into table DEFAULT.dim_custom;
dim_custom.csv
ct001,custom_john,ct001,beijing,zgx-beijing,beijing,beijing,internet ct002,custom_herry,ct002,henan,shlinjie,shangdang,henan,internet
5. 建立產品維度表並插入數據
DROP TABLE IF EXISTS DEFAULT.dim_product ; create table DEFAULT.dim_product ( product_key string, product_name string, product_id string, product_desc string, sku string, brand string, brand_code string, brand_manager string, category string, category_code string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; load data local inpath 'dim_product.csv' overwrite into table DEFAULT.dim_product;
dim_product.csv
pd001,Box-Large,pd001,Box-Large-des,large1.0,brand001,brandcode001,brandmanager001,Packing,cate001 pd002,Box-Medium,pd001,Box-Medium-des,medium1.0,brand001,brandcode001,brandmanager001,Packing,cate001 pd003,Box-small,pd001,Box-small-des,small1.0,brand001,brandcode001,brandmanager001,Packing,cate001 pd004,Evelope,pd001,Evelope_des,large3.0,brand001,brandcode001,brandmanager001,Pens,cate002
這樣一個星型的結構表在hive中建立完畢, 實際上一個離線的數據倉庫已經完成, 它包含一個主題, 即商品訂單.
關於商品訂單的統計需求可使用hive命令產生. 好比:
1. 統計20150501到20150502全部的訂單數.
Hive> select dday.full_day, sum(quantity_ordered) from fact_order as fact inner join dim_day as dday on fact.time_key == dday.day_key and dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" group by dday.full_day order by dday.full_day;
2015-05-01 600
2015-05-02 600
2. 統計20150501到20150502各個銷售員的銷售訂單數
select dday.full_day, dsp.salesperson_key, sum(quantity_ordered) from fact_order as fact
inner join dim_day as dday on fact.time_key == dday.day_key
inner join dim_salesperson as dsp on fact.salesperson_key == dsp.salesperson_key
where dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02"
group by dday.full_day, dsp.salesperson_key
order by dday.full_day;
2015-05-01 sp003 100
2015-05-01 sp002 100
2015-05-01 sp001 400
2015-05-02 sp003 100
2015-05-02 sp002 100
2015-05-02 sp001 400
kylin在hive的基礎上倉庫olap數據cube, 完成實時數據倉庫服務的任務. kylin在hive的基礎上完成:
1. 將星型數據庫部署在hbase上實現實時的查詢服務
2. 提供restful查詢接口
3. 集成BI
首先, 建立一個數據倉庫工程(kylin_test_project)
其次, 點擊tables標籤,點擊"load hive table"按鈕, 同步上述的全部hive表
完成hive表和kylin的同步.
接着, 簡歷kylin的數據cube
點擊cube 和新增cube按鈕.
1. 命名cube order_cube
2. 增長fact 和 dim 表
3. 增長維度
4. 增長mesure值
5. 不用選filter條件
6. 選擇開始開始時間
7. 完成
而後, build cube
能夠在jobs中查看build狀態. build過程其實是把cube存到hbase中, 方便快速檢索.