Apache Drill

單機模式

[qihuang.zheng@dp0653 ~]$ cd apache-drill-1.0.0
[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/drill-embedded
apache drill 1.0.0
"json ain't no thang"
0: jdbc:drill:zk> select * from cp.`employee.json` limit 2;
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
| employee_id  |    full_name     | first_name  | last_name  | position_id  |   position_title    | store_id  | department_id  | birth_date  |       hire_date        |  salary  | supervisor_id  | education_level  | marital_status  | gender  |  management_role   |
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
| 1            | Sheri Nowmer     | Sheri       | Nowmer     | 1            | President           | 0         | 1              | 1961-08-26  | 1994-12-01 00:00:00.0  | 80000.0  | 0              | Graduate Degree  | S               | F       | Senior Management  |
| 2            | Derrick Whelply  | Derrick     | Whelply    | 2            | VP Country Manager  | 0         | 1              | 1915-07-03  | 1994-12-01 00:00:00.0  | 40000.0  | 1              | Graduate Degree  | M               | M       | Senior Management  |
+--------------+------------------+-------------+------------+--------------+---------------------+-----------+----------------+-------------+------------------------+----------+----------------+------------------+-----------------+---------+--------------------+
2 rows selected (1.247 seconds)

drill使用zookeeper進行集羣. 其中local表示使用本機的zk. html

也可使用sqlline啓動:java

[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/sqlline -u jdbc:drill:zk=local
log4j:WARN No appenders could be found for logger (DataNucleus.General).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
六月 15, 2015 11:11:18 上午 org.glassfish.jersey.server.ApplicationHandler initialize
信息: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.0.0
"a drill in the hand is better than two in the bush"
0: jdbc:drill:zk=local>

退出drill的方式:mysql

0: jdbc:drill:zk=local> !quit
Closing: org.apache.drill.jdbc.DrillJdbc41Factory$DrillJdbc41Connection

使用後臺進程的方式啓動:web

[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/drillbit.sh start
starting drillbit, logging to /home/qihuang.zheng/apache-drill-1.0.0/log/drillbit.out

查看drill進程sql

[qihuang.zheng@dp0653 apache-drill-1.0.0]$ jps -lm
2788 org.apache.drill.exec.server.Drillbit
3045 sqlline.SqlLine -d org.apache.drill.jdbc.Driver --maxWidth=10000 --color=true -u jdbc:drill:zk=local

第一個是drillbit的後臺進程, 第二個是使用sqlline或者dril-embbed啓動的客戶端進程數據庫

Storage Plugin

cp是classpath storage plugin, drill的web ui: http://192.168.6.53:8047/storage
Drill支持不一樣的存儲介質, 而且能夠從不一樣的存儲介質中使用SQL查詢數據.apache

Storage Plugin

默認只有cp和dfs是enable的. 在Diabled Storage Plugins中點擊某個插件的Enable, 就可使用這個存儲插件了json

添加HDFS插件

默認沒有hdfs, 能夠在New Storage Plugin中輸入hdfs, 點擊Create,
在Configuration中輸入hdfs的存儲插件配置信息:bash

{
  "type": "file",
  "enabled": true,
  "connection": "hdfs://192.168.6.53:9000/",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    }
  }
}

路徑(dfs和hdfs)

設置dfs插件的工做目錄: 點擊dfs插件的Update, 添加work目錄, 而後點擊Update架構

"connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    },
    "work": {
      "location": "/home/qihuang.zheng/apache-drill-1.0.0/",
      "writable": true,
      "defaultInputFormat": null
    }
  },

對於hdfs也能夠自定義一個本身的工做空間好比work=/user/zhengqh. 則定位到/user/zhengqh下,直接使用hfs.work進行查詢

DFS文件

下面測試了使用不一樣的路徑查詢drill安裝目錄下sample-data下的parquet文件

  • 沒有使用定義好的work工做目錄,致使沒法找到文件

  • 使用了自定義的work目錄(注意work的使用方式: dfs.work.``), 使用相對路徑也能找到文件

  • 絕對路徑

0: jdbc:drill:zk> select * from dfs.`sample-data/region.parquet` limit 2;
Error: PARSE ERROR: From line 1, column 15 to line 1, column 17: Table 'dfs.sample-data/region.parquet' not found
[Error Id: a1e53ed6-cc07-4799-9e9f-a7b112bb4e36 on dp0657:31010] (state=,code=0)

0: jdbc:drill:zk> select * from dfs.work.`sample-data/region.parquet` limit 2;
+--------------+----------+-----------------------+
| R_REGIONKEY  |  R_NAME  |       R_COMMENT       |
+--------------+----------+-----------------------+
| 0            | AFRICA   | lar deposits. blithe  |
| 1            | AMERICA  | hs use ironic, even   |
+--------------+----------+-----------------------+
2 rows selected (0.338 seconds)

0: jdbc:drill:zk> select * from dfs.`/home/qihuang.zheng/apache-drill-1.0.0/sample-data/region.parquet` limit 2;
+--------------+----------+-----------------------+
| R_REGIONKEY  |  R_NAME  |       R_COMMENT       |
+--------------+----------+-----------------------+
| 0            | AFRICA   | lar deposits. blithe  |
| 1            | AMERICA  | hs use ironic, even   |
+--------------+----------+-----------------------+
2 rows selected (0.235 seconds)

HDFS文件

  • 第一個查詢直接使用了相對路徑, 由於默認的hdfs插件的root指向的是/, 而它的connection配置路徑是: hdfs://192.168.6.53:9000/.

  • 第二個查詢使用了絕對路徑

0: jdbc:drill:zk> select count(*) from hdfs.`user/admin/evidence`;

1 row selected (0.586 seconds)
0: jdbc:drill:zk> select count(*) from hdfs.`hdfs://tdhdfs/user/admin/evidence`;

1 row selected (0.278 seconds)

這裏還有一個知識點: 能夠直接查詢文件夾下的全部文件. 也能夠是文件夾下的子文件夾均可以

[qihuang.zheng@dp0653 ~]$ /usr/install/hadoop/bin/hadoop fs -ls /user/admin/evidence
15/06/17 08:25:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r--   3 shuoyi.zhao supergroup   67561800 2015-05-21 10:49 /user/admin/evidence/4b75f114-7f64-40df-9ff6-11a1e75637a7.parquet
-rw-r--r--   3 shuoyi.zhao supergroup   96528887 2015-05-21 10:49 /user/admin/evidence/bdb0bdb4-fa04-402f-af05-b2aea02728ed.parquet
-rw-r--r--   3 shuoyi.zhao supergroup   80968799 2015-05-21 10:49 /user/admin/evidence/da1439fc-0c32-4cd8-90f2-67d24dbaa6cc.parquet
-rw-r--r--   3 shuoyi.zhao supergroup  136852232 2015-05-21 10:50 /user/admin/evidence/f0954a8f-583b-4173-9b89-55ed3107daf1.parquet

複雜SQL查詢

  1. 兩表join查詢

SELECT nations.name, regions.name FROM (
  SELECT N_REGIONKEY as regionKey, N_NAME as name
  FROM dfs.work.`sample-data/nation.parquet`
) nations join (
  SELECT R_REGIONKEY as regionKey, R_NAME as name
  FROM dfs.work.`sample-data/region.parquet`
) regions 
  on nations.regionKey = regions.regionKey
  order by nations.name;

+-----------------+--------------+
|      name       |    name0     |
+-----------------+--------------+
| ALGERIA         | AFRICA       |
| ARGENTINA       | AMERICA      |
| BRAZIL          | AMERICA      |
| CANADA          | AMERICA      |
| CHINA           | ASIA         |
| EGYPT           | MIDDLE EAST  |
| ETHIOPIA        | AFRICA       |
| FRANCE          | EUROPE       |
| GERMANY         | EUROPE       |
| INDIA           | ASIA         |
| INDONESIA       | ASIA         |
| IRAN            | MIDDLE EAST  |
| IRAQ            | MIDDLE EAST  |
| JAPAN           | ASIA         |
| JORDAN          | MIDDLE EAST  |
| KENYA           | AFRICA       |
| MOROCCO         | AFRICA       |
| MOZAMBIQUE      | AFRICA       |
| PERU            | AMERICA      |
| ROMANIA         | EUROPE       |
| RUSSIA          | EUROPE       |
| SAUDI ARABIA    | MIDDLE EAST  |
| UNITED KINGDOM  | EUROPE       |
| UNITED STATES   | AMERICA      |
| VIETNAM         | ASIA         |
+-----------------+--------------+
25 rows selected (1.038 seconds)
  1. 子查詢in

SELECT N_REGIONKEY as regionKey, N_NAME as name  
FROM dfs.work.`sample-data/nation.parquet` 
WHERE cast(N_NAME as varchar(10)) IN ('INDIA', 'CHINA');

+------------+--------+
| regionKey  |  name  |
+------------+--------+
| 2          | INDIA  |
| 2          | CHINA  |
+------------+--------+

Drill鏈接Hive

HIVE使用(本機環境: cdh542)

drill中有一個默認的hive配置項:

{
  "type": "hive",
  "enabled": false,
  "configProps": {
    "hive.metastore.uris": "",
    "javax.jdo.option.ConnectionURL": "jdbc:derby:;databaseName=../sample-data/drill_hive_db;create=true",
    "hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
    "fs.default.name": "file:///",
    "hive.metastore.sasl.enabled": "false"
  }
}

咱們修改爲使用hive-site.xml中的配置項:

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://localhost:9083",
    "hive.metastore.sasl.enabled": "false"
  }
}
  • 啓動hadoop: start-all.sh

  • 啓動hive: hive --service metastore和hiveserver2

  • 啓動drill: bin/drill-embedded

  • 進入drill的命令行中, 和hive的一些語法相似, 好比下面列出已經存在的數據庫show databases, 定位到某個數據庫use xxx...

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| hive.default        |
| hive.wiki           |
| sys                 |
+---------------------+
8 rows selected (0.627 seconds)
0: jdbc:drill:zk=local> use hive.wiki;
+-------+----------------------------------------+
|  ok   |                summary                 |
+-------+----------------------------------------+
| true  | Default schema changed to [hive.wiki]  |
+-------+----------------------------------------+
1 row selected (0.156 seconds)
0: jdbc:drill:zk=local> show tables;
+---------------+-------------+
| TABLE_SCHEMA  | TABLE_NAME  |
+---------------+-------------+
| hive.wiki     | invites     |
| hive.wiki     | pokes       |
| hive.wiki     | u_data      |
| hive.wiki     | u_data_new  |
+---------------+-------------+
4 rows selected (1.194 seconds)
0: jdbc:drill:zk=local> select count(*) from invites;
+---------+
| EXPR$0  |
+---------+
| 525     |
+---------+
1 row selected (4.204 seconds)

HIVE測試環境(hive-1.2.0)

修改hive的配置信息:

{
  "type": "hive",
  "enabled": true,
  "configProps": {
    "hive.metastore.uris": "thrift://192.168.6.53:9083",
    "javax.jdo.option.ConnectionURL": "jdbc:mysql://192.168.6.53:3306/hive?characterEncoding=UTF-8",
    "hive.metastore.warehouse.dir": "/user/hive/warehouse",
    "fs.default.name": "hdfs://tdhdfs",
    "hive.metastore.sasl.enabled": "false"
  }
}

注: 上面綠色部分除了fs.default.name都不是必須的.

問題: 沒法查詢hive表數據
在測試環境遇到一個問題: 死活查不出來hive中的表(可是show databases, show tables, describe xx都是正常)
好比select count(*) from employee; 後就一直不動了. 觀察web ui顯示pending狀態

pending job

用Control+C取消後, 執行其餘以前正常的命令都沒法執行了, 使用!quit也沒法正常退出. 只能經過kill -9 pid殺死sqlline進程!

問題追蹤
將conf下的logback.xml的日誌級別改爲debug. 這樣執行每一條命令都會打印出日誌信息
前面的語句都沒有問題, 當執行查詢hive表數據的時候, 報錯連的是另一個地址: tdhdfs/220.250.64.20:8020

strange address

問題思考
搜了一番hadoop /etc/hosts以及dns; hdfs ha dns以後無果.
而後想到220.250.64.20:8020其中8020端口根本就是默認的.
而咱們的測試集羣使用的是hdfs ha, 而且用的是9000端口.

說明drill根本沒有找到hadoop的配置! 即便在hive的配置頁面指定了hdfs.default.name爲hdfs://tdhdfs
正由於沒有drill沒有找到hadoop的配置文件, 那麼咱們就要手動讓drill知道hadoop的配置文件位置!

其餘問題
啓動drill-embedded的時候有一個報錯:

10:43:54.789 [main] DEBUG org.apache.hadoop.util.Shell - Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

雖然沒有影響drill的啓動. 但仍是修改下:
vi ~/.bashrc

export HADOOP_HOME="/usr/install/hadoop"
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export PATH="$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin"

並在drill-env.sh中添加

export HADOOP_HOME="/usr/install/hadoop"

上面增長的配置雖然啓動時再也不報錯, 可是並不能解決咱們以前遇到的問題.

問題解決
拷貝hadoop安裝目錄下的core-site.xml, mapred-site.xml, hdfs-site.xml, yarn-site.xml到DRILL/conf下!
重啓bin/drill-embedded. (發現重啓後, 原先的hive和hdfs配置都不見了, 因此要從新update)

[qihuang.zheng@dp0653 conf]$ ll -rt
-rw-r--r--. 1 qihuang.zheng users 3835 5月  16 10:35 drill-override-example.conf
-rwxr-xr-x. 1 qihuang.zheng users 1276 6月  16 10:51 drill-env.sh
-rw-r--r--. 1 qihuang.zheng users 2354 6月  16 15:12 core-site.xml
-rw-r--r--. 1 qihuang.zheng users 3257 6月  16 15:12 hdfs-site.xml
-rw-r--r--. 1 qihuang.zheng users 2111 6月  16 15:12 mapred-site.xml
-rw-r--r--. 1 qihuang.zheng users 8382 6月  16 15:12 yarn-site.xml
-rw-r--r--. 1 qihuang.zheng users 3119 6月  16 15:25 logback.xml
-rw-r--r--. 1 qihuang.zheng users 1237 6月  16 15:35 drill-override.conf

Hive的數據類型

目前Drill並不支持Hive一些複雜的結構類型, 好比LIST, MAP, STRUCT, UNION

0: jdbc:drill:zk> describe sadf;
+-------------------+---------------------------------------+--------------+
|    COLUMN_NAME    |               DATA_TYPE               | IS_NULLABLE  |
+-------------------+---------------------------------------+--------------+
| id       | VARCHAR                               | YES          |
| time        | BIGINT                                | YES          |
| a_Map      | (VARCHAR(65535), VARCHAR(65535)) MAP  | NO           |
| indice            | VARCHAR                               | YES          |
+-------------------+---------------------------------------+--------------+
8 rows selected (0.504 seconds)
0: jdbc:drill:zk> select count(*) from sadf;
Error: SYSTEM ERROR: java.lang.RuntimeException: Unsupported Hive data type MAP.
Following Hive data types are supported in Drill for querying: BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, DATE, TIMESTAMP, BINARY, DECIMAL, STRING, and VARCHAR

Fragment 1:0

[Error Id: 7c5dd0d4-1e18-4dbc-b470-1eb6ca6a3b36 on dp0653:31010] (state=,code=0)
0: jdbc:drill:zk> describe int_table;
+--------------+------------+--------------+
| COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
+--------------+------------+--------------+
| id           | INTEGER    | YES          |
+--------------+------------+--------------+
1 row selected (0.334 seconds)
0: jdbc:drill:zk> select count(*) from int_table;
+---------+
| EXPR$0  |
+---------+
| 90      |
+---------+
1 row selected (0.654 seconds)

So What Can We do when we want to query Hive Table which has map type?

在drill的mail-list上看到這樣的一個回覆:
http://mail-archives.apache.org/mod_mbox/drill-dev/201504.mbox/browser

We haven't yet added support for Hive's Map type. Can we work together on

adding this? Drill doesn't distinguish between maps and structs given its
support for schemaless data. If you could post a small example piece of
data, maybe we could figure out the best way to work together to add this
functionality. As I said, it is mostly just a metadata mapping exercise
since Drill already has complex type support in the execution engine. You
can see how it works by looking at the JSONReader complex Parquet reader.

大體的意思是咱們如今不支持hive的map類型. 若是你行, 你過來和咱們一塊兒合做開發啊.
爲何呢? 由於drill支持無模式的數據, 因此map類型仍是結構類型對於drill而言都是同樣的.
Drill的執行引擎中已經支持了複雜的類型. 你能夠看看怎麼讀JSON或者Parquet格式的文件是怎麼作的.

而後想到hive包含有map類型的表結構雖然drill不支持. 可是drill可使用hive的數據啊.
既然hive的表結構是有必定schema的. 那麼它的數據格式也必定是有格式的.
因此這裏雖然drill能夠和hive公用表結構, 若是咱們直接用hive的表數據, 至關於仍是使用hdfs插件了.

Drill分佈式模式

  • 上面在單機上的配置項, 將drill文件夾複製到集羣中. 注意修改drill下conf的drill-override.conf

drill.exec: {
  cluster-id: "drillbits1",
  zk.connect: "192.168.6.55:2181,192.168.6.56:2181,192.168.6.57:2181"
}

每臺節點的cluster-id都是同樣的. 保證了全部的節點組成一個集羣.
這和ElasticSearch集羣的安裝同樣. 它的好處是隨時能夠擴展節點, 而不須要更改原先的任何配置.

  • 而後在每臺機器上都啓動bin/drillbit.sh start

  • 隨便訪問任意一臺機器的8047端口, 均可以列出集羣中的全部drill服務

http://192.168.6.52:8047/
http://192.168.6.53:8047/
http://192.168.6.54:8047/
http://192.168.6.56:8047/
http://192.168.6.57:8047/

distribute mode

客戶端鏈接

Drill提供了一些工具, 包括第三方工具也提供了訪問Drill數據的方法.
主要是Drill和其餘SQL DB同樣提供了一個ODBC Driver. 參考: https://drill.apache.org/docs/interfaces-introduction/| 1682c688168d0b13e379a526fd4de5fa21 |

sqlline

鏈接本地ZK

bin/sqlline -u jdbc:drill:zk=local

鏈接ZK集羣

  • 手動指定ZK

[qihuang.zheng@dp0653 apache-drill-1.0.0]$ bin/sqlline -u jdbc:drill:zk=192.168.6.55,192.168.6.56,192.168.6.57:2181
  • 若是是集羣模式, 也能夠不跟上zk地址: bin/sqlline -u jdbc:drill:zk 會自動讀取drill-override.conf的配置

注意點

當指定的zk是一個全新的ZK, 以前若是使用zk=local在本次新的zk會話中Storage-Plugin的信息都丟失.

由於咱們指定的zookeeper集羣是全新的. 因此drill尚未往裏面寫入任何數據.
這是由於在web ui上對Storage Plugin進行update或者create的數據都會寫入到對應的zookeeper節點上!
當咱們在界面上update hive, 而且enable後, 經過show databases就能夠看到hive裏的表了

iodbc

iodbc data source manager

選擇一個已有的Driver, 修改鏈接類型, 若是是ZooKeeper,要指定ZK集羣和clusterId
若是是Direct, 則直接指定要鏈接的Drill的host和port
iodbc

測試成功後, 新建一個SQL查詢
iodbc query

點擊OK後, 會返回查詢結果
iodbc result

iodbc terminal

$ iodbctest
iODBC Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0607.1008

Enter ODBC connect string (? shows list): ?

DSN                              | Driver
------------------------------------------------------------------------------
Sample MapR Drill DSN            | MapR Drill ODBC Driver

Enter ODBC connect string (? shows list): DRIVER=MapR Drill ODBC Driver;AdvancedProperties= {HandshakeTimeout=0;QueryTimeout=0; TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys, INFORMATION_SCHEMA;};Catalog=DRILL;Schema=; ConnectionType=Direct;Host=192.168.6.53;Port=31010
1: SQLDriverConnect = [iODBC][Driver Manager]dlopen(MapR Drill ODBC Driver, 6): image not found (0) SQLSTATE=00000
2: SQLDriverConnect = [iODBC][Driver Manager]Specified driver could not be loaded (0) SQLSTATE=IM003

Enter ODBC connect string (? shows list): DSN=Sample MapR Drill DSN;ConnectionType=Direct;Host=192.168.6.53;Port=31010
Driver: 1.0.0.1001 (MapR Drill ODBC Driver)

SQL>select count(*) from cp.`employee.json`

EXPR$0
--------------------
1155

 result set 1 returned 1 rows.


SQL>

注意上面輸入ODBC的鏈接字符串, 按照官方文檔有些地方寫的是:

DRIVER=MapR Drill ODBC Driver;AdvancedProperties= {HandshakeTimeout=0;QueryTimeout=0; TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys, INFORMATION_SCHEMA;};Catalog=DRILL;Schema=; ConnectionType=Direct;Host=192.168.6.53;Port=31010

會報錯說image not found. 正確的格式應該是:

DSN=Sample MapR Drill DSN;ConnectionType=Direct;Host=192.168.6.53;Port=31010

Drill Explorer

Drill Expoloer鏈接Drill的字符串格式和上面同樣, 在Advance中輸入

explorer connect

在Drill Explorer中能夠瀏覽數據, 而且能夠創建一些視圖
expoloer

性能測試

HDFS的Parquet文件查詢(單機和分佈式模式對比)

單機模式:

0: jdbc:drill:zk=local> select count(*) from hdfs.`/user/admin/evi`;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+----------+
|  EXPR$0  |
+----------+
| 4003278  |
+----------+
1 row selected (0.854 seconds)

0: jdbc:drill:zk=local> select fraud_type,count(*) from hdfs.`/user/admin/evi` group by fraud_type order by count(*) desc;

+--------------------+----------+
14 rows selected (4.451 seconds)

五臺機器的分佈式模式:

0: jdbc:drill:zk=192.168.6.55,192.168.6.56,19> select count(*) from hdfs.`/user/admin/evidence`;
+----------+
|  EXPR$0  |
+----------+
| 4003278  |
+----------+
1 row selected (0.394 seconds)

0: jdbc:drill:zk=192.168.6.55,192.168.6.56,19> select fraud_type,count(*) from hdfs.`/user/admin/evidence` group by fraud_type order by count(*) desc;
+--------------------+----------+
|     fraud_type     |  EXPR$1  |
+--------------------+----------+
14 rows selected (1.744 seconds)

實驗現象1: Foreman不固定
在每臺機器的8047端口的Profile中看到並不必定每臺機器都回顯示Queries.
好比在dp0655上運行時, 其餘幾臺機器都沒有 只有dp0656上纔有.
並且即便是在相同的客戶端, 不一樣的會話也會在不一樣的Foreman上運行.

foreman

A: Foreman只是相似Facade, 是最終返回查詢結果給客戶端的節點. 只須要一個便可.
Drill分佈式計算會由Forman決定如何派發數據給不一樣的Drillbit節點.

如何驗證: 查看Profiles下某個Query, 一般第一個Major Fragment就是Forman節點.
其他的Major Fragment會分發到不一樣的節點.

profile

其實從Drill的架構圖也能夠看出Forman只有一個

query-flow-client

leaf-frag

實驗現象2: 第一次查詢慢
有些查詢在第一次執行時較慢. 後面一樣的語句會快一倍多.
但最後會穩定下來好比上面的group by order by測試結果(400萬條,分組後排序)
橫座標表示依次在這些機器上執行, 縱座標表示在這臺機器上執行了屢次一樣的SQL語句.

Round dp0653 dp0652 dp0655 dp0657 dp0656 dp0653
Round1 7.871 5.079 4.8299 1.764 1.557 4.305
Round2 2.549 2.103 2.106 1.66 1.418 1.854
Round3 1.888 1.893 1.779 1.534 1.512 1.955
Round4 1.841 1.641 1.703
Round5 1.744 1.714 1.9
Round6 1.763 1.572 1.53

Drill和Hive查詢對比

TODO

Q&A

  • [ ] Q: 爲何第一次執行會比較慢?

  • [x] Q: 既然是分佈式的, 爲何每次執行時, 只派發給一個Foreman?

    A: Forman只是最終返回給客戶端的節點, 只須要一個便可.  
      可是具體的查詢Foreman會分發給多個幾點!

TODO

  • 測試環境hive中沒什麼表, 並且koudai表的類型是map, drill不支持map類型沒法查詢
    準備導入一些測試數據集進來測下

  • Drill + Tableau

參考文檔

Drill官網
Google Dremel 原理 - 如何能3秒分析1PB

apache drill 0.8.0 單機/分佈式安裝測試
部署分佈式Drill集羣
Apache Drill環境搭建及鏈接hdfs

相關文章
相關標籤/搜索