開源OLAP+數據可視化工具–For Apache Kylin

 

通過我和好友Rocky一段時間的修改和調試,現已將第一版的Caravel For Kylin上傳至Github,供你們使用,若有發現任何的問題,可經過本博客或github聯繫咱們。html

GitHub地址:git

https://github.com/rocky1001/pykylin/tree/caravel-kylingithub

https://github.com/rocky1001/caravel/tree/caravel-kylinapp

基於Caravel 0.8.9工具

關於Caravel:http://airbnb.io/caravel/  http://lxw1234.com/archives/2016/06/681.htm大數據

關於PyKylin:https://github.com/wxiang7/pykylin (感謝做者@Wu Xiang)優化

修改說明

PyKylin:
修復where及having條件中的中文問題。
支持kylin中多表關聯查詢(實現方式很粗糙,但功能基本實現);
優化kylin多表查詢時候時間序列相關的幾個圖表生成的SQL。spa

Caravel:
修復kylin中不支持的部分語法,好比使用timestamp關鍵字做爲時間字段的別名。.net

安裝與啓動

Caravel:
按照官網的安裝文檔便可 http://airbnb.io/caravel/installation.html
Pykylin:調試

https://github.com/rocky1001/pykylin/tree/caravel-kylin

強烈推薦使用Python3,避免中文問題。

啓動Caravel:
nohup gunicorn -w 16 –timeout 60 -b 0.0.0.0:8080 caravel:app >> /tmp/caravel.log 2>&1 &

注意:不建議使用官網給出的啓動命令,caravel runserver -d
避免因查詢頁面關閉形成的Caravel Server掛掉。

使用說明

對於普通單表的使用,請參考:
http://lxw1234.com/archives/2016/06/681.htm

配置Kylin多表查詢

爲了支持Kylin中多表查詢,通常是一個事實表關聯多張維度表,並獲取衍生維度,咱們經過在Caravel Table中添加自定義字段,並對該字段定義必定規則的表達式(字段串常量),提交到PyKylin以後,PyKylin解析該字符串常量,轉換成與維度 表關聯的SQL查詢並返回結果。
這裏作法有些粗糙,但功能基本實現,你能夠參考該思路作進一步的優化和修改。

這裏以事實表AD_REPORT2爲例,其中有維度ID字段AD_ID,在Kylin中構建Cube時候,經過INNER JOIN維度表AD_DIM來獲取維度名稱AD_NAME。

在Caravel的AD_REPORT2中添加字段ad_name,

kylin

該字段表達式爲字符串常量:

‘$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$’

字符串以$開頭和結尾,以|分隔。

INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)

定義了維度表、JOIN類型、ON條件,該字符串會直接添加到原始SQL中,做爲JOIN子句;
在維度表中的字段前面加上__爲了和事實表中的字段區分而不用考慮表的別名。

b.__ad_name

定義了使用該字段做爲最終的字段取值。

當選擇ad_name做爲維度查詢時,Caravel提交給PyKylin的SQL語句爲:

SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS ad_name,
SUM(imp_pv) AS sum__imp_pv
FROM liuxiaowen.AD_REPORT2
WHERE pt >= '2015-06-13'
AND pt <= '2016-06-13'
GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$'
ORDER BY SUM(imp_pv) DESC
LIMIT 50

PyKylin通過轉換後提交給Kylin的SQL語句爲:

SELECT b.__ad_name as ad_name,
SUM(imp_pv) AS sum__imp_pv
FROM liuxiaowen.ad_report2
inner join (SELECT ad_id AS __ad_id,ad_name AS __ad_name FROM liuxiaowen.ad_dim) AS b
ON (ad_id = __ad_id)
WHERE pt >= '2015-06-13' AND pt <= '2016-06-13'
GROUP BY b.__ad_name
ORDER BY SUM(imp_pv) DESC

對於時間序列類的圖表,Caravel提交給PyKylin的SQL語句爲:

 SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS ad_name,
pt AS _timestamp,
SUM(imp_pv) AS sum__imp_pv
FROM liuxiaowen.AD_REPORT2
JOIN (
SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS __ad_name
FROM liuxiaowen.AD_REPORT2
WHERE pt >= '2015-06-13' AND pt <= '2016-06-13'
GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$'
ORDER BY SUM(imp_pv) DESC
LIMIT 50
) AS anon_1

ON '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' = __ad_name
WHERE pt >= '2015-06-13' AND pt <= '2016-06-13'
GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$', pt
ORDER BY SUM(imp_pv) DESC
LIMIT 50000

PyKylin通過優化後的SQL語句爲:

 SELECT b.__ad_name AS ad_name,
pt AS _timestamp,
SUM(imp_pv) AS sum__imp_pv
FROM liuxiaowen.ad_report2
inner join (SELECT ad_id AS __ad_id,ad_name AS __ad_name FROM liuxiaowen.ad_dim) AS b
ON (ad_id = __ad_id)
WHERE pt >= '2015-06-13' AND pt <= '2016-06-13'
GROUP by b.__ad_name, pt
ORDER by sum(imp_pv) desc
limit 50

該配置方法對於關聯一張維度表獲取多個字段,以及關聯多張維度表獲取多個維度字段一樣適用,只須要在Caravel Table中添加多個字段,表達式的寫法同樣便可。

另外,該修改只針對數據源類型爲kylin,對於Caravel使用其餘數據源不受影響。

轉載請註明:lxw的大數據田地 » 開源OLAP+數據可視化工具–For Apache Kylin

相關文章
相關標籤/搜索