一。impala架構html
Impala是Cloudera在受到Google的Dremel啓發下開發的實時交互SQL大數據查詢工具,Impala沒有再使用緩慢的Hive+MapReduce批處理,而是經過使用與商用並行關係數據庫中相似的分佈式查詢引擎(由Query Planner、Query Coordinator和Query Exec Engine三部分組成),能夠直接從HDFS或HBase中用SELECT、JOIN和統計函數查詢數據,從而大大下降了延遲。前端
impala架構圖:java
Impala由三個服務組成:impalad, statestored, catalogd。
node
執行計劃:
Impala: 經過詞法分析生成執行計劃,執行計劃表現爲一棵完整的執行計劃樹,能夠更天然地分發執行計劃到各個Impalad執行查詢,在分發執行計劃後,Impala使用拉式獲取數據的方式獲取結果,把結果數據組成按執行樹流式傳遞聚集,減小的了把中間結果寫入磁盤的步驟,再從磁盤讀取數據的開銷。
impala的前端負責將sql轉化成執行計劃(java),包含兩個階段:單節點計劃生成、並行化和分段。第一階段對sql進行解析、分析、優化(RBO和CBO,統計信息目前只有表大小和列的NDV,無histogram),第二階段生成分佈式的執行計劃,肯定是否要加exchange節點(是否存在partitioned join或hash aggregation),選擇join strategy(partitioned join or broadcast join)等,最後以exchange爲邊界將計劃分段(fragment),做爲impala的基本運行單元。
impala相對於hive優缺點:
優勢:
python
二。impala 安裝mysql
使用cdh安裝 參考以前環境(https://blog.csdn.net/liaomin416100569/article/details/80045833) 確保安裝以前 先
安裝了hadoop和hive
cdh集羣 添加服務
由於以前hadoop位於cdh4(單機版)
hive 安裝於 cdh3(單機)
catalogserver和statestore安裝在 cdh3
impala daemon 必須安裝數據節點cdh4上
web
安裝完成 安裝路徑時/opt/cloudera/parcels/CDH/lib/impalasql
注意通常若是出錯 能夠 查看 /var/log對應目錄的日誌信息
服務 啓動impalashell
三。impala shell和sql數據庫
在任意的cdh節點上 運行命令 impala-shell 便可操做impala 只有 cdh4上有impalad進程 他能夠直接登陸 其餘機器
[root@cdh2 impala]# impala-shell -i cdh4 Starting Impala Shell without Kerberos authentication Connected to cdh4:21000 Server version: impalad version 2.5.0-cdh5.7.6 RELEASE (build ecbba4f4e6d5eec6c33c1e02412621b8b9c71b6a) *********************************************************************************** Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved. (Impala Shell v2.5.0-cdh5.7.6 (ecbba4f) built on Tue Feb 21 14:54:50 PST 2017) After running a query, type SUMMARY to see a summary of where time was spent. *********************************************************************************** [cdh4:21000] >該命令的一些可選項 解釋
root@cdh4 ~]# impala-shell --help Usage: impala_shell.py [options] Options: -i IMPALAD, --impalad=IMPALAD <host:port> 鏈接到對應的impalad的服務器 [默認是: cdh4:21000] -q QUERY, --query=QUERY 直接能夠在命令行執行一個查詢的sql語句 -f QUERY_FILE, --query_file=QUERY_FILE 執行查詢文件的查詢sql語句, 多條使用 ; 隔開 [default: none] -o OUTPUT_FILE, --output_file=OUTPUT_FILE 設置將查詢結構輸出到文件中 --print_header 查詢時是否打印表頭 [default: False] --output_delimiter=OUTPUT_DELIMITER 輸出內容的行的列的分隔符 默認是\t [default: \t] -r, --refresh_after_connect 鏈接以後刷新 Impala catalog 自動從hive的metastore同步數據庫及表結構信息等元數據 [default: False] -d DEFAULT_DB, --database=DEFAULT_DB 指定默認使用的數據庫名稱 等價於 use 數據庫名 [default: none] -u USER, --user=USER 受權登陸的用[default: root]進入命令內部後 可使用的經常使用命令 以下:
1 。建立數據庫(http://impala.apache.org/docs/build/html/topics/impala_create_database.html#create_database)
[cdh4:21000] > create database myimpala; Query: create database myimpala默認使用hive建立 目錄位於hive的 /user/hive/warehouse下 查看
[root@cdh4 ~]# hdfs dfs -ls /user/hive/warehouse Found 1 items drwxrwxrwt - impala hive 0 2018-04-24 16:54 /user/hive/warehouse/myimpala.db2。表操做(http://impala.apache.org/docs/build/html/topics/impala_tables.html)
內表表示元數據和文件數據被hive內部常常管理 刪除內表數據 全部數據都會被刪除 默認表都是內表
外表 表示文件數據被外部管理刪除外表 外表數據不會被刪除 好比多個表引用同一份數據時適用於使用外表
這裏演示簡單例子 同hive同樣
建立表
CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;/soft目錄下 a.txt 內容
[root@cdh4 soft]# more a.txt 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.0 2015-12-13 10:56:20,1,www.baidu.com,www.qq.com,192.168.99.1 2015-12-13 9:56:20,1,www.baidu.com,www.qq.com,192.168.99.2 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.3 2015-12-13 44:56:20,1,www.baidu.com,www.qq.com,192.168.99.4將文件傳送到hdfs上
hdfs dfs -mkdir /im hdfs dfs -put -f a.txt /im執行load時報錯
[cdh4:21000] > LOAD DATA INPATH '/im/a.txt' INTO TABLE page_view ; Query: load DATA INPATH '/a.txt' INTO TABLE page_view ERROR: AnalysisException: Unable to LOAD DATA from hdfs://cdh4:8020/a.txt because Impala does not have WRITE permissions on its parent directory hdfs://cdh4:8020/由於Impala使用impala用戶操做hdfs 因此沒有權限 (hadoop操做用戶是使用 hdfs 經過 hdfs dfs -ls / 查看)
[root@cdh4 soft]# hadoop fs -chown -R impala:supergroup /im chown: changing ownership of '/im': Non-super user cannot change owner [root@cdh4 soft]# su - hdfs [hdfs@cdh4 ~]$ hadoop fs -chown -R impala:supergroup /im再次嘗試shell導入 查看數據
[cdh4:21000] > select * from page_view; Query: select * from page_view +----------+--------+---------------+--------------+--------------+ | viewtime | userid | page_url | referrer_url | ip | +----------+--------+---------------+--------------+--------------+ | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | | NULL | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | +----------+--------+---------------+--------------+--------------+ WARNINGS: Error converting column: 0 TO INT (Data is: 2015-12-13 11:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.0 Error converting column: 0 TO INT (Data is: 2015-12-13 10:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 10:56:20,1,www.baidu.com,www.qq.com,192.168.99.1 Error converting column: 0 TO INT (Data is: 2015-12-13 9:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 9:56:20,1,www.baidu.com,www.qq.com,192.168.99.2 Error converting column: 0 TO INT (Data is: 2015-12-13 11:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 11:56:20,1,www.baidu.com,www.qq.com,192.168.99.3 Error converting column: 0 TO INT (Data is: 2015-12-13 44:56:20) file: hdfs://cdh4:8020/user/hive/warehouse/myimpala.db/page_view/a.txt record: 2015-12-13 44:56:20,1,www.baidu.com,www.qq.com,192.168.99.4 Fetched 5 row(s) in 1.37sviewtime 定義成了int類型 全部沒法導入 修改表
alter table page_view change viewTime viewTime STRING
再次查看
[cdh4:21000] > select * from page_view; Query: select * from page_view +---------------------+--------+---------------+--------------+--------------+ | viewtime | userid | page_url | referrer_url | ip | +---------------------+--------+---------------+--------------+--------------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | | 2015-12-13 10:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | | 2015-12-13 9:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | | 2015-12-13 44:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | +---------------------+--------+---------------+--------------+--------------+ Fetched 5 row(s) in 1.40s3。表分區(http://impala.apache.org/docs/build/html/topics/impala_tables.html)
仍是以前的數據 建立一個parquet格式 而且是分區數據 建立表
CREATE TABLE page_view_parquet( viewTime STRING, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS PARQUET;由於咱們上傳的的文本文件 /im/a,txt是文本文件因此不能直接load data會格式錯誤 能夠直接插入數據 全部分區實例先添加
alter table page_view_parquet add partition (dt='2015-12-13', country='CHINA');插入數據測試
[cdh4:21000] > insert into page_view_parquet partition (dt='2015-12-13', country='CHINA') values('2015-12-13 11:56:20',1,'www.baidu.com','www.baidu.com','192.168.7.7'); Query: insert into page_view_parquet partition (dt='2015-12-13', country='CHINA') values('2015-12-13 11:56:20',1,'www.baidu.com','www.baidu.com','192.168.7.7') Inserted 1 row(s) in 0.80s [cdh4:21000] > select * from page_view_parquet; Query: select * from page_view_parquet +---------------------+--------+---------------+---------------+-------------+------------+---------+ | viewtime | userid | page_url | referrer_url | ip | dt | country | +---------------------+--------+---------------+---------------+-------------+------------+---------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.baidu.com | 192.168.7.7 | 2015-12-13 | CHINA | +---------------------+--------+---------------+---------------+-------------+------------+---------+ Fetched 1 row(s) in 0.15s將以前page_view數據轉換到page_view_parquet表 從 TEXTFILE格式轉換成 PARQUET
[cdh4:21000] > insert into table page_view_parquet select * from page_view; Query: insert into table page_view_parquet select * from page_view ERROR: AnalysisException: Not enough partition columns mentioned in query. Missing columns are: dt, country [cdh4:21000] > insert into table page_view_parquet partition (dt='2015-12-13', country='CHINA') select * from page_view; Query: insert into table page_view_parquet partition (dt='2015-12-13', country='CHINA') select * from page_view Inserted 5 row(s) in 1.49s [cdh4:21000] > select * from page_view_parquet; Query: select * from page_view_parquet +---------------------+--------+---------------+---------------+--------------+------------+---------+ | viewtime | userid | page_url | referrer_url | ip | dt | country | +---------------------+--------+---------------+---------------+--------------+------------+---------+ | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.baidu.com | 192.168.7.7 | 2015-12-13 | CHINA | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.0 | 2015-12-13 | CHINA | | 2015-12-13 10:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.1 | 2015-12-13 | CHINA | | 2015-12-13 9:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.2 | 2015-12-13 | CHINA | | 2015-12-13 11:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.3 | 2015-12-13 | CHINA | | 2015-12-13 44:56:20 | 1 | www.baidu.com | www.qq.com | 192.168.99.4 | 2015-12-13 | CHINA | +---------------------+--------+---------------+---------------+--------------+------------+---------+ Fetched 6 row(s) in 0.32s經過cdh控制檯 點擊hdfs進入 點擊namenode web ui 查看 hdfs具體建立的文件