Apache Drill是一個低延遲的分佈式海量數據(涵蓋結構化、半結構化以及嵌套數據)交互式查詢引擎,使用ANSI SQL兼容語法,支持本地文件、HDFS、HBase、MongoDB等後端存儲,支持Parquet、JSON、CSV、TSV、PSV等數據格式。受Google的Dremel啓發,Drill知足上千節點的PB級別數據的交互式商業智能分析場景。html
Drill能夠安裝在單機或者集羣環境上,支持Linux、Windows、Mac OS X系統。簡單起見,咱們在Linux單機環境(CentOS 6.3)搭建以供試用。java
準備安裝包:mysql
在$WORK(/path/to/work)目錄中安裝,將jdk和drill分別解壓到java和drill目錄中,並打軟連以便升級:linux
. ├── drill │ ├── apache-drill -> apache-drill-0.8.0 │ └── apache-drill-0.8.0 ├── init.sh └── java ├── jdk -> jdk1.7.0_75 └── jdk1.7.0_75
並添加一init.sh腳本初始化java相關環境變量:sql
export WORK="/path/to/work" export JAVA="$WORK/java/jdk/bin/java" export JAVA_HOME="$WORK/java/jdk"
在單機環境運行只須要啓動bin/sqlline即可:apache
$ cd $WORK $ . ./init.sh $ ./drill/apache-drill/bin/sqlline -u jdbc:drill:zk=local Drill log directory /var/log/drill does not exist or is not writable, defaulting to ... Apr 06, 2015 12:47:30 AM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... sqlline version 1.1.6 0: jdbc:drill:zk=local>
-u jdbc:drill:zk=local
表示使用本機的Drill,無需啓動ZooKeeper,若是是集羣環境則須要配置和啓動ZooKeeper並填寫地址。啓動後即可以在0: jdbc:drill:zk=local>
後敲入命令使用了。json
Drill的sample-data目錄有Parquet格式的演示數據可供查詢:後端
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/apache-drill/sample-data/nation.parquet` limit 5; +-------------+------------+-------------+------------+ | N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT | +-------------+------------+-------------+------------+ | 0 | ALGERIA | 0 | haggle. carefully f | | 1 | ARGENTINA | 1 | al foxes promise sly | | 2 | BRAZIL | 1 | y alongside of the p | | 3 | CANADA | 1 | eas hang ironic, sil | | 4 | EGYPT | 4 | y above the carefull | +-------------+------------+-------------+------------+ 5 rows selected (0.741 seconds)
這裏用的庫名格式爲dfs.`本地文件(Parquet、JSON、CSV等文件)絕對路徑`。能夠看出只要熟悉SQL語法幾乎沒有學習成本。但Parquet格式文件須要專用工具查看、編輯,不是很方便,後續再專門介紹,下文先使用更通用的CSV和JSON文件進行演示。promise
在$WORK/data
中建立以下test.csv
文件:微信
1101,SteveEurich,Steve,Eurich,16,StoreT 1102,MaryPierson,Mary,Pierson,16,StoreT 1103,LeoJones,Leo,Jones,16,StoreTem 1104,NancyBeatty,Nancy,Beatty,16,StoreT 1105,ClaraMcNight,Clara,McNight,16,Store
而後查詢:
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.csv`; +------------+ | columns | +------------+ | ["1101","SteveEurich","Steve","Eurich","16","StoreT"] | | ["1102","MaryPierson","Mary","Pierson","16","StoreT"] | | ["1103","LeoJones","Leo","Jones","16","StoreTem"] | | ["1104","NancyBeatty","Nancy","Beatty","16","StoreT"] | | ["1105","ClaraMcNight","Clara","McNight","16","Store"] | +------------+ 5 rows selected (0.082 seconds)
能夠看到結果和以前的稍有不一樣,由於CSV文件沒有地方存放列列名,因此統一用columns
代替,若是須要具體制定列則須要用columns[n]
,如:
0: jdbc:drill:zk=local> select columns[0], columns[3] from dfs.`/path/to/work/drill/data/test.csv`; +------------+------------+ | EXPR$0 | EXPR$1 | +------------+------------+ | 1101 | Eurich | | 1102 | Pierson | | 1103 | Jones | | 1104 | Beatty | | 1105 | McNight | +------------+------------+
CSV文件格式比較簡單,發揮不出Drill的強大優點,下邊更復雜的功能使用和Parquet更接近的JSON文件進行演示。
在$WORK/data
中建立以下test.json
文件:
{ "ka1": 1, "kb1": 1.1, "kc1": "vc11", "kd1": [ { "ka2": 10, "kb2": 10.1, "kc2": "vc1010" } ] } { "ka1": 2, "kb1": 2.2, "kc1": "vc22", "kd1": [ { "ka2": 20, "kb2": 20.2, "kc2": "vc2020" } ] } { "ka1": 3, "kb1": 3.3, "kc1": "vc33", "kd1": [ { "ka2": 30, "kb2": 30.3, "kc2": "vc3030" } ] }
能夠看到這個JSON文件內容是有多層嵌套的,結構比以前那個CSV文件要複雜很多,而查詢嵌套數據正是Drill的優點所在。
0: jdbc:drill:zk=local> select * from dfs.`/path/to/work/drill/data/test.json`; +------------+------------+------------+------------+ | ka1 | kb1 | kc1 | kd1 | +------------+------------+------------+------------+ | 1 | 1.1 | vc11 | [{"ka2":10,"kb2":10.1,"kc2":"vc1010"}] | | 2 | 2.2 | vc22 | [{"ka2":20,"kb2":20.2,"kc2":"vc2020"}] | | 3 | 3.3 | vc33 | [{"ka2":30,"kb2":30.3,"kc2":"vc3030"}] | +------------+------------+------------+------------+ 3 rows selected (0.098 seconds)
select *
只查出第一層的數據,更深層的數據只以本來的JSON數據呈現出來,咱們顯然不該該只關心第一層的數據,具體怎麼查徹底爲所欲爲:
0: jdbc:drill:zk=local> select sum(ka1), avg(kd1[0].kb2) from dfs.`/path/to/work/drill/data/test.json`; +------------+------------+ | EXPR$0 | EXPR$1 | +------------+------------+ | 6 | 20.2 | +------------+------------+ 1 row selected (0.136 seconds)
能夠經過kd1[0]
來訪問嵌套到第二層的這個表。
0: jdbc:drill:zk=local> select kc1, kd1[0].kc2 from dfs.`/path/to/work/drill/data/test.json` where kd1[0].kb2 = 10.1 and ka1 = 1; +------------+------------+ | kc1 | EXPR$1 | +------------+------------+ | vc11 | vc1010 | +------------+------------+ 1 row selected (0.181 seconds)
建立view:
0: jdbc:drill:zk=local> create view dfs.tmp.tmpview as select kd1[0].kb2 from dfs.`/path/to/work/drill/data/test.json`; +------------+------------+ | ok | summary | +------------+------------+ | true | View 'tmpview' created successfully in 'dfs.tmp' schema | +------------+------------+ 1 row selected (0.055 seconds) 0: jdbc:drill:zk=local> select * from dfs.tmp.tmpview; +------------+ | EXPR$0 | +------------+ | 10.1 | | 20.2 | | 30.3 | +------------+ 3 rows selected (0.193 seconds)
能夠把嵌套的第二層表打平(整合kd1[0]..kd1[n]):
0: jdbc:drill:zk=local> select kddb.kdtable.kc2 from (select flatten(kd1) kdtable from dfs.`/path/to/work/drill/data/test.json`) kddb; +------------+ | EXPR$0 | +------------+ | vc1010 | | vc2020 | | vc3030 | +------------+ 3 rows selected (0.083 seconds)
使用細節上和mysql仍是有所不一樣的,另外涉及到多層表的複雜邏輯,要想用得駕輕就熟還須要仔細閱讀官方文檔並多多練習。此次先蜻蜓點水了,以後會深刻了解語法層面的特性。
付費解決 Windows、Linux、Shell、C、C++、AHK、Python、JavaScript、Lua 等領域相關問題,靈活訂價,歡迎諮詢,微信 ly50247。