通過我和好友Rocky一段時間的修改和調試,現已將第一版的Caravel For Kylin上傳至Github,供你們使用,若有發現任何的問題,可經過本博客或github聯繫咱們。html
GitHub地址:git
https://github.com/rocky1001/pykylin/tree/caravel-kylingithub
https://github.com/rocky1001/caravel/tree/caravel-kylinapp
基於Caravel 0.8.9工具
關於Caravel:http://airbnb.io/caravel/ http://lxw1234.com/archives/2016/06/681.htm大數據
關於PyKylin:https://github.com/wxiang7/pykylin (感謝做者@Wu Xiang)優化
PyKylin:
修復where及having條件中的中文問題。
支持kylin中多表關聯查詢(實現方式很粗糙,但功能基本實現);
優化kylin多表查詢時候時間序列相關的幾個圖表生成的SQL。spa
Caravel:
修復kylin中不支持的部分語法,好比使用timestamp關鍵字做爲時間字段的別名。.net
Caravel:
按照官網的安裝文檔便可 http://airbnb.io/caravel/installation.html
Pykylin:調試
https://github.com/rocky1001/pykylin/tree/caravel-kylin
強烈推薦使用Python3,避免中文問題。
啓動Caravel:
nohup gunicorn -w 16 –timeout 60 -b 0.0.0.0:8080 caravel:app >> /tmp/caravel.log 2>&1 &
注意:不建議使用官網給出的啓動命令,caravel runserver -d
避免因查詢頁面關閉形成的Caravel Server掛掉。
對於普通單表的使用,請參考:
http://lxw1234.com/archives/2016/06/681.htm
爲了支持Kylin中多表查詢,通常是一個事實表關聯多張維度表,並獲取衍生維度,咱們經過在Caravel Table中添加自定義字段,並對該字段定義必定規則的表達式(字段串常量),提交到PyKylin以後,PyKylin解析該字符串常量,轉換成與維度 表關聯的SQL查詢並返回結果。
這裏作法有些粗糙,但功能基本實現,你能夠參考該思路作進一步的優化和修改。
這裏以事實表AD_REPORT2爲例,其中有維度ID字段AD_ID,在Kylin中構建Cube時候,經過INNER JOIN維度表AD_DIM來獲取維度名稱AD_NAME。
在Caravel的AD_REPORT2中添加字段ad_name,
該字段表達式爲字符串常量:
‘$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$’
字符串以$開頭和結尾,以|分隔。
INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)
定義了維度表、JOIN類型、ON條件,該字符串會直接添加到原始SQL中,做爲JOIN子句;
在維度表中的字段前面加上__爲了和事實表中的字段區分而不用考慮表的別名。
b.__ad_name
定義了使用該字段做爲最終的字段取值。
當選擇ad_name做爲維度查詢時,Caravel提交給PyKylin的SQL語句爲:
SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS ad_name, SUM(imp_pv) AS sum__imp_pv FROM liuxiaowen.AD_REPORT2 WHERE pt >= '2015-06-13' AND pt <= '2016-06-13' GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' ORDER BY SUM(imp_pv) DESC LIMIT 50
PyKylin通過轉換後提交給Kylin的SQL語句爲:
SELECT b.__ad_name as ad_name, SUM(imp_pv) AS sum__imp_pv FROM liuxiaowen.ad_report2 inner join (SELECT ad_id AS __ad_id,ad_name AS __ad_name FROM liuxiaowen.ad_dim) AS b ON (ad_id = __ad_id) WHERE pt >= '2015-06-13' AND pt <= '2016-06-13' GROUP BY b.__ad_name ORDER BY SUM(imp_pv) DESC
對於時間序列類的圖表,Caravel提交給PyKylin的SQL語句爲:
SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS ad_name, pt AS _timestamp, SUM(imp_pv) AS sum__imp_pv FROM liuxiaowen.AD_REPORT2 JOIN ( SELECT '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' AS __ad_name FROM liuxiaowen.AD_REPORT2 WHERE pt >= '2015-06-13' AND pt <= '2016-06-13' GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' ORDER BY SUM(imp_pv) DESC LIMIT 50 ) AS anon_1 ON '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$' = __ad_name WHERE pt >= '2015-06-13' AND pt <= '2016-06-13' GROUP BY '$|INNER JOIN (select ad_id as __ad_id,ad_name as __ad_name from LIUXIAOWEN.AD_DIM) as b ON (ad_id = __ad_id)|b.__ad_name|$', pt ORDER BY SUM(imp_pv) DESC LIMIT 50000
PyKylin通過優化後的SQL語句爲:
SELECT b.__ad_name AS ad_name, pt AS _timestamp, SUM(imp_pv) AS sum__imp_pv FROM liuxiaowen.ad_report2 inner join (SELECT ad_id AS __ad_id,ad_name AS __ad_name FROM liuxiaowen.ad_dim) AS b ON (ad_id = __ad_id) WHERE pt >= '2015-06-13' AND pt <= '2016-06-13' GROUP by b.__ad_name, pt ORDER by sum(imp_pv) desc limit 50
該配置方法對於關聯一張維度表獲取多個字段,以及關聯多張維度表獲取多個維度字段一樣適用,只須要在Caravel Table中添加多個字段,表達式的寫法同樣便可。
另外,該修改只針對數據源類型爲kylin,對於Caravel使用其餘數據源不受影響。