【原創】大數據基礎之Impala（1）簡介、安裝、使用

時間 2019-11-21

原文原文鏈接

impala2.12html

官方：http://impala.apache.org/java

一簡介

Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon.node

impala是hadoop上的開源分析性數據庫；C++和java語言開發；web

Do BI-style Queries on Hadoop
- Impala provides low latency and high concurrency for BI/analytic queries on Hadoop (not delivered by batch frameworks such as Apache Hive). Impala also scales linearly, even in multitenant environments.

impala支持hadoop上低延遲和高併發的查詢。sql

Unify Your Infrastructure
- Utilize the same file and data formats and metadata, security, and resource management frameworks as your Hadoop deployment—no redundant infrastructure or data conversion/duplication.

使用一樣的文件、格式和元數據。shell

Implement Quickly
- For Apache Hive users, Impala utilizes the same metadata and ODBC driver. Like Hive, Impala supports SQL, so you don't have to worry about re-inventing the implementation wheel.

對於hive用戶來講，impala使用相同的元數據和driver，支持sql。數據庫

Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries.apache

impala直接基於hadoop數據（hdsf、hbase等）實現快速的、交互式的sql查詢；impala使用與hive相同的存儲平臺、元數據、sql語法、driver和ui，這樣實現了實時查詢和批處理查詢的統一；服務器

Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.session

impala是一個大數據查詢工具集的有力補充，impala不替換現有的批處理框架好比hive（hive一般用來執行一些ETL任務）；

To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration.

Impala provides:

Familiar SQL interface that data scientists and analysts already know.
Ability to query high volumes of data ("big data") in Apache Hadoop.
Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity hardware.
Ability to share data files between different components with no copy or export/import step; for example, to write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling simple data interchange using Impala for analytics on Hive-produced data.
Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for analytics.

impala架構

The Impala server is a distributed, massively parallel processing (MPP) database engine.

1 Impala Daemon

The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work across the cluster; and transmits intermediate query results back to the central coordinator node.

impala deamon（即impalad）和數據節點部署在一塊兒，負責讀寫數據、響應impala-shell/Hue/JDBC請求、分佈式查詢、返回查詢結果，部署多個；

2 Impala Statestore

The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored; you only need such a process on one host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable node.

impala statestore檢查和記錄impala deamon服務器的健康狀況，這樣查詢時能夠踢掉不健康的節點，只須要部署1個。

3 Impala Catalog Service

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. It is physically represented by a daemon process named catalogd; you only need such a process on one host in the cluster. Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host.

impala catalog負責元數據，只須要1個。

客戶端

The impala-shell interactive command interpreter.
The Hue web-based user interface.
JDBC.

二安裝

安裝支持3種方式：

1 Cloudera Manager安裝

頁面操做

2 Ambari安裝

詳見 http://www.javashuo.com/article/p-obeykrvh-kr.html

3 手工安裝

1 增長repo

# cat /etc/yum.repos.d/cdh.repo

[cloudera-cdh5]
# Packages for Cloudera's Distribution for Hadoop, Version 5, on RedHat or CentOS 7 x86_64
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/5/
gpgkey =https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1

2 安裝

# yum install impala impala-catalog impala-server impala-state-store impala-shell

也能夠細分安裝

catalog 安裝
# yum install impala impala-catalog

server安裝
# yum install impala impala-server

statestore安裝
# yum install impala impala-state-store

客戶端安裝
# yum install impala-shell

配置文件修改catalogd和statestored的地址

# vi /etc/default/impala
IMPALA_CATALOG_SERVICE_HOST=$catalog_server
IMPALA_STATE_STORE_HOST=$state_store_server

MEM_LIMIT=20gb

MEM_LIMIT賦值格式爲*gb，*g，*m，*mb，70%

注意catalogd和statestored只能部署單點，沒有內置的failover機制，官方建議是必要時經過dns切換；

其餘hadoop、hive、hbase等配置文件（core-site.xml、hdfs-site.xml、hive-site.xml、hbase-site.xml）放到

/etc/impala/conf/

啓動命令

service impala-statestore start
service impala-catalog start
service impala-server start

注意：impala須要用到hive的元數據，2.12支持hive2及如下，不支持hive3；

經過Llama能夠實現impala on yarn部署；

ps：也能夠手工下載rpm安裝：https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/5/RPMS/x86_64/

impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
impala-catalog-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
impala-server-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
impala-shell-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
impala-state-store-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm
impala-udf-devel-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el7.x86_64.rpm

不過rpm安裝會有不少依賴

# rpm -ivh impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64.rpm
warning: impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID e8f86acd: NOKEY
error: Failed dependencies:
hadoop is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hadoop-hdfs is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hadoop-yarn is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hadoop-mapreduce is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hbase is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hive >= 0.12.0+cdh5.1.0 is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
zookeeper is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
hadoop-libhdfs is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
avro-libs is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
parquet is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
sentry >= 1.3.0+cdh5.1.0 is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
sentry is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64
libhdfs.so.0.0.0()(64bit) is needed by impala-2.12.0+cdh5.16.1+0-1.cdh5.16.1.p0.3.el6.x86_64

impala server頁面

三使用

impala-server有兩個端口

port:21000, for impala-shell and ODBC driver 1.2.
port:21050, for JDBC and for ODBC driver 2.

1 impala-shell

使用impala-shell

$ impala-shell -i $impala_server:21000
Starting Impala Shell without Kerberos authentication
Connected to $impala_server:21000
Server version: impalad version 2.12.0-cdh5.16.1 RELEASE (build 4a3775ef6781301af81b23bca45a9faeca5e761d)
***********************************************************************************
Welcome to the Impala shell.
(Impala Shell v2.12.0-cdh5.16.1 (4a3775e) built on Wed Nov 21 21:02:28 PST 2018)

When you set a query option it lasts for the duration of the Impala shell session.
***********************************************************************************
[$impala_server:21000] >

鏈接成功以後像hive同樣使用；

2 beeline（jdbc）

須要先下載impala driver

下載

# wget https://downloads.cloudera.com/connectors/impala_jdbc_2.6.4.1005.zip
# unzip impala_jdbc_2.6.4.1005.zip
# cd ClouderaImpalaJDBC-2.6.4.1005
# unzip ClouderaImpalaJDBC4-2.6.4.1005.zip

beeline鏈接

# beeline -u jdbc:hive2://$impala_server:21050

# export HIVE_AUX_JARS_PATH=/path/to/ClouderaImpalaJDBC-2.6.4.1005/ImpalaJDBC4.jar
# beeline -d com.cloudera.impala.jdbc4.Driver -u jdbc:impala://$impala_server:21050
Connecting to jdbc:impala://$impala_server:21050
Connected to: Impala (version 2.12.0-cdh5.16.1)
Driver: ImpalaJDBC (version 02.06.04.1005)
Error: [Cloudera][JDBC](11975) Unsupported transaction isolation level: 4. (state=HY000,code=11975)
Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
0: jdbc:impala://$impala_server:21050> show databases;

注意這裏有個Error可是不影響使用；

查詢sql以後，經過summary查看剛纔的查詢統計

[localhost:21000] > summary;
+--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+
| 06:AGGREGATE | 1 | 230.00ms | 230.00ms | 1 | 1 | 16.00 KB | -1 B | FINALIZE |
| 05:EXCHANGE | 1 | 43.44us | 43.44us | 1 | 1 | 0 B | -1 B | UNPARTITIONED |
| 02:AGGREGATE | 1 | 227.14ms | 227.14ms | 1 | 1 | 12.00 KB | 10.00 MB | |
| 04:AGGREGATE | 1 | 126.27ms | 126.27ms | 150.00K | 150.00K | 15.17 MB | 10.00 MB | |
| 03:EXCHANGE | 1 | 44.07ms | 44.07ms | 150.00K | 150.00K | 0 B | 0 B | HASH(c_name) |
| 01:AGGREGATE | 1 | 361.94ms | 361.94ms | 150.00K | 150.00K | 23.04 MB | 10.00 MB | |
| 00:SCAN HDFS | 1 | 43.64ms | 43.64ms | 150.00K | 150.00K | 24.19 MB | 64.00 MB | tpch.customer |
+--------------+--------+----------+----------+---------+------------+----------+---------------+---------------+

經過profile查看詳細的查詢過程

[localhost:21000] > profile;

強制刷新一個表元數據

> REFRESH [db_name.]table_name [PARTITION (key_col1=val1 [, key_col2=val2...])]

強制刷新全部元數據

> invalidate metadata

參考：

Impala: A Modern, Open-Source SQL Engine for Hadoop：http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf

Apache Impala Guide：http://impala.apache.org/docs/build/impala-2.12.pdf