星型數據倉庫olap工具kylin介紹

星型數據倉庫olap工具kylin介紹mysql

 

數據倉庫是目前企業級BI分析的重要平臺,尤爲在互聯網公司,天天都會產生數以百G的日誌,如何從這些日誌中發現數據的規律很重要. 數據倉庫是數據分析的重要工具, 每一個大公司都花費數百萬每一年的資金進行數據倉庫的運維.sql

本文介紹一個基於hadoop的數據倉庫, 它基於hadoop(HIVE, HBASE)水平擴展的特性, 客服傳統olap受限於關係型數據庫數據容量的問題. Kylin是ebay推出的olap星型數據倉庫的開源實現. 數據庫

首先請安裝Kylin, 和它的運行環境(Hadoop, yarn, hive, hbase). 若是安裝成功, 登錄(http://<KYLIN_HOST>:7070/), 用戶名:ADMIN, 密碼(KYLIN). 安裝過程請參考(http://kylin.incubator.apache.org/download/,  注意下載編譯後的二進制包, 免去不少編譯煩惱).apache

在建立數據倉庫前, 咱們先聊一下, 什麼是數據倉庫.restful

 

從業務過程的角度考慮, 信息系統能夠劃分爲兩個主要類別, 一類用於支持業務過程的執行, 表明做品是mysql; 另外一類用於支持業務過程的分析, 表明做品是hive, 還有就是今天的主角kylin.運維

首先, 數據倉庫的設計

下圖展現了一個簡單的基於訂單流程中事實和維度的星型模型.工具

這是一個典型的星型結構, 訂單的事實表有3個度量值(messures)(訂單數量, 訂單金額, 和訂單成本); 另外有4個度量維度(dimession), 分別是時間, 產品, 銷售員, 客戶. 這裏時間以天爲單位,  這裏注意day_key必須是(YYYY-MM-DD)格式(這是kylin的規定). oop

其次, 根據數據倉庫的設計建立hive表

1. 建立事實表並插入數據ui

DROP TABLE IF EXISTS DEFAULT.fact_order ;

create table DEFAULT.fact_order (
    time_key string,
    product_key string,
    salesperson_key string,
    custom_key string,
    quantity_ordered bigint,
    order_dollars bigint,
    cost_dollars bigint

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
load data local inpath 'fact_order.csv' overwrite into table DEFAULT.fact_order; 

 

fact_order.csvspa

2015-05-01,pd001,sp001,ct001,100,101,51
2015-05-01,pd001,sp002,ct002,100,101,51
2015-05-01,pd001,sp003,ct002,100,101,51
2015-05-01,pd002,sp001,ct001,100,101,51
2015-05-01,pd003,sp001,ct001,100,101,51
2015-05-01,pd004,sp001,ct001,100,101,51
2015-05-02,pd001,sp001,ct001,100,101,51
2015-05-02,pd001,sp002,ct002,100,101,51
2015-05-02,pd001,sp003,ct002,100,101,51
2015-05-02,pd002,sp001,ct001,100,101,51
2015-05-02,pd003,sp001,ct001,100,101,51
2015-05-02,pd004,sp001,ct001,100,101,51

2. 建立天維度表day_dim

DROP TABLE IF EXISTS DEFAULT.dim_day ;

create table DEFAULT.dim_day (
    day_key string,
    full_day string,
    month_name string,
    quarter string, 
    year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

load data local inpath 'dim_day.csv' overwrite into table DEFAULT.dim_day; 

dim_day.csv

2015-05-01,2015-05-01,201505,2015q2,2015
2015-05-02,2015-05-02,201505,2015q2,2015
2015-05-03,2015-05-03,201505,2015q2,2015
2015-05-04,2015-05-04,201505,2015q2,2015
2015-05-05,2015-05-05,201505,2015q2,2015

3. 建立售賣員的維度表salesperson_dim

DROP TABLE IF EXISTS DEFAULT.dim_salesperson ;

create table DEFAULT.dim_salesperson (
    salesperson_key string,
    salesperson string,
    salesperson_id string, 
    region string,
    region_code string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

load data local inpath 'dim_salesperson.csv' overwrite into table DEFAULT.dim_salesperson; 

dim_salesperson.csv

sp001,hongbin,sp001,beijing,10086
sp002,hongming,sp002,beijing,10086
sp003,hongmei,sp003,beijing,10086

4. 建立客戶維度 custom_dim

DROP TABLE IF EXISTS DEFAULT.dim_custom ;

create table DEFAULT.dim_custom (
        custom_key string,
        custom_name string,
        custorm_id string, 
        headquarter_states string,
        billing_address string,
    billing_city string,
    billing_state string,
    industry_name string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

load data local inpath 'dim_custom.csv' overwrite into table DEFAULT.dim_custom; 

  

dim_custom.csv

ct001,custom_john,ct001,beijing,zgx-beijing,beijing,beijing,internet                     
ct002,custom_herry,ct002,henan,shlinjie,shangdang,henan,internet     

5. 建立產品維度表並插入數據

DROP TABLE IF EXISTS DEFAULT.dim_product ;                                               
                                                                                         
create table DEFAULT.dim_product (                                                       
    product_key string,                                                                  
    product_name string,                                                                 
    product_id string,                                                                   
    product_desc string,                                                                 
    sku string,                                                                          
    brand string,                                                                        
    brand_code string,                                                                   
    brand_manager string,                                                                
    category string,                                                                     
    category_code string                                                                 
)                                                                                        
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','                                            
STORED AS TEXTFILE;                                                                      
                                                                                         
load data local inpath 'dim_product.csv' overwrite into table DEFAULT.dim_product;       

dim_product.csv

pd001,Box-Large,pd001,Box-Large-des,large1.0,brand001,brandcode001,brandmanager001,Packing,cate001
pd002,Box-Medium,pd001,Box-Medium-des,medium1.0,brand001,brandcode001,brandmanager001,Packing,cate001
pd003,Box-small,pd001,Box-small-des,small1.0,brand001,brandcode001,brandmanager001,Packing,cate001
pd004,Evelope,pd001,Evelope_des,large3.0,brand001,brandcode001,brandmanager001,Pens,cate002

這樣一個星型的結構表在hive中建立完畢, 實際上一個離線的數據倉庫已經完成, 它包含一個主題, 即商品訂單.

關於商品訂單的統計需求可使用hive命令產生. 好比:

1. 統計20150501到20150502全部的訂單數.

Hive> select dday.full_day, sum(quantity_ordered) from fact_order as fact inner join dim_day  as dday on fact.time_key == dday.day_key and dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" group by dday.full_day order by dday.full_day;

2015-05-01      600

2015-05-02      600

 

2. 統計20150501到20150502各個銷售員的銷售訂單數

select dday.full_day, dsp.salesperson_key, sum(quantity_ordered) from fact_order as fact 

    inner join dim_day  as dday on fact.time_key == dday.day_key 

    inner join dim_salesperson as dsp on fact.salesperson_key == dsp.salesperson_key  

    where dday.full_day >= "2015-05-01" and dday.full_day <= "2015-05-02" 

    group by dday.full_day, dsp.salesperson_key 

    order by dday.full_day;

2015-05-01      sp003   100

2015-05-01      sp002   100

2015-05-01      sp001   400

2015-05-02      sp003   100

2015-05-02      sp002   100

2015-05-02      sp001   400

而後,導入kylin數據倉庫中

kylin在hive的基礎上倉庫olap數據cube, 完成實時數據倉庫服務的任務. kylin在hive的基礎上完成:

1. 將星型數據庫部署在hbase上實現實時的查詢服務

2. 提供restful查詢接口

3. 集成BI

首先, 建立一個數據倉庫工程(kylin_test_project)

其次, 點擊tables標籤,點擊"load hive table"按鈕, 同步上述的全部hive表

 

完成hive表和kylin的同步.

接着, 簡歷kylin的數據cube

點擊cube 和新增cube按鈕.

1. 命名cube order_cube

2. 增長fact 和 dim 表

3. 增長維度

4. 增長mesure值

5. 不用選filter條件

6. 選擇開始開始時間

7. 完成

而後, build cube 

能夠在jobs中查看build狀態. build過程其實是把cube存到hbase中, 方便快速檢索.

相關文章
相關標籤/搜索