Hadoop基礎-12-Hive

時間 2021-08-15

標籤 java node mysql git github web sql shell 數據庫 express 欄目 Hadoop 简体版

原文原文鏈接

源碼見：https://github.com/hiszm/hadoop-trainjava

Hive概述

http://hive.apache.org/node

Hive是什麼

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.mysql

Hive 是一個構建在 Hadoop 之上的數據倉庫，它能夠將結構化的數據文件映射成表，並提供類 SQL 查詢功能，用於查詢的 SQL 語句會被轉化爲 MapReduce 做業，而後提交到 Hadoop 上運行。git

爲何要使用Hive
- MappReduce 編程不便
- 同時也要RDBMS關係型數據庫
- HDFS上沒有schema的概念

(schema就是數據庫對象的集合 , 所謂的數據庫對象也就是常說的表，索引，視圖，存儲過程等。)github

Hive特色：

簡單、容易上手 (提供了相似 SQL 的查詢語言 HQL)，使得精通 sql 可是不瞭解 Java 編程的人也能很好地進行大數據分析；
靈活性高，底層引擎支持 MR/ Tez /Spark；
爲超大的數據集設計的計算和存儲能力，集羣擴展容易;
統一的元數據管理，可與presto／impala／sparksql等共享數據；
執行延遲高，不適合作數據的實時處理，但適合作海量數據的離線處理。

Hive體系架構

client : shell , jdbc, webUI(zeppelin)
metastore : 指數據庫中的元數據

Hive部署架構

Hive與RDBMS的區別

	Hive	RDBMS
查詢語言	Hive SQL	SQL
數據儲存	HDFS	Raw Device or Local FS
索引	無(支持比較弱)	有
執行	MapReduce、 Tez	Excutor
執行時延	高，離線	低 , 在線
數據規模	很是大, 大	小

Hive部署

得到wget hive-1.1.0-cdh5.15.1.tar.gz(url)
解壓 tar -zxvf hive-1.1.0-cdh5.15.1.tar.gz -C ~/app/
配置環境變量

export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.15.1
export PATH=$HIVE_HOME/bin:$PATH

生效 source ~/.bash_profile

[hadoop@hadoop000 app]$ source ~/.bash_profile 
[hadoop@hadoop000 app]$ echo $HIVE_HOME
/home/hadoop/app/hive-1.1.0-cdh5.15.1

修改配置

/conf/hive-env.sh
增長一行
web

/conf/hive-site.xml
添加文件sql

[hadoop@hadoop000 conf]$ cat hive-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hadoop000:3306/hadoop_hive?createDatabaseIfNotExist=true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
</property>
</configuration>
[hadoop@hadoop000 conf]$

MySQL驅動 mysql-connector-java-5.1.27-bin.jar

拷貝到目錄home/hadoop/app/hive-1.1.0-cdh5.15.1/libshell

安裝數據庫 用yum 安裝

[hadoop@hadoop000 lib]$ mysql -uroot -proot
Warning: Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.6.42 MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Hive快速入門

啓動hive

[hadoop@hadoop000 sbin]$ jps
3218 SecondaryNameNode
3048 DataNode
3560 NodeManager
3451 ResourceManager
2940 NameNode
3599 Jps

hive> create database test
    > ;
OK

select * from DBS\G;
數據庫

mysql> mysql> select * from DBS\G;
*************************** 1. row ***************************
          DB_ID: 1
           DESC: Default Hive database
DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse
           NAME: default
     OWNER_NAME: public
     OWNER_TYPE: ROLE
*************************** 2. row ***************************
          DB_ID: 3
           DESC: NULL
DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse/hive.db
           NAME: hive
     OWNER_NAME: hadoop
     OWNER_TYPE: USER
*************************** 3. row ***************************
          DB_ID: 4
           DESC: NULL
DB_LOCATION_URI: hdfs://hadoop000:8020/test/location
           NAME: hive2
     OWNER_NAME: hadoop
     OWNER_TYPE: USER
*************************** 4. row ***************************
          DB_ID: 6
           DESC: NULL
DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse/test.db
           NAME: test
     OWNER_NAME: hadoop
     OWNER_TYPE: USER
4 rows in set (0.00 sec)

ERROR: 
No query specified

Hive DDL

Hive DDL=Hive Data Definition Languageexpress

數據庫操做

Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [MANAGEDLOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

hive> create DATABASE hive_test;
OK
Time taken: 0.154 seconds
hive>

再HDFS上的 默認路徑 /user/hive/warehouse/hive_test.db
默認的hive數據庫沒有default.db的路徑/user/hive/warehouse/

自定義建立路徑

hive> create DATABASE hive_test2 LOCATION '/test/hive'
    > ;
OK
Time taken: 0.119 seconds
hive> 


[hadoop@hadoop000 network-scripts]$ hadoop fs -ls /test/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2020-09-09 06:29 /test/hive

自定義參數

DESC DATABASE [EXTENDED] db_name; --EXTENDED 表示是否顯示額外屬性

hive> create DATABASE hive_test3 LOCATION '/test/hive'
    > with DBPROPERTIES('creator'='jack');
OK
Time taken: 0.078 seconds
hive> desc database hive_test3
    > ;
OK
hive_test3		hdfs://hadoop000:8020/test/hive	hadoop	USER	
Time taken: 0.048 seconds, Fetched: 1 row(s)
hive> desc database extended hive_test3;
OK
hive_test3		hdfs://hadoop000:8020/test/hive	hadoop	USER	{creator=jack}
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive>

顯示當前目錄

hive> set hive.cli.print.current.db;
hive.cli.print.current.db=false
hive> set hive.cli.print.current.db=true;
hive (default)>

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

hive (default)> show databases;
OK
default
hive
hive2
hive_test
hive_test3
test
Time taken: 0.02 seconds, Fetched: 6 row(s)
hive (default)> drop database hive_test3;
OK
Time taken: 0.099 seconds
hive (default)> show databases;
OK
default
hive
hive2
hive_test
test
Time taken: 0.019 seconds, Fetched: 5 row(s)
hive (default)>

查找數據庫

hive (default)> show databases like 'hive*';
OK
hive
hive2
hive_test
Time taken: 0.024 seconds, Fetched: 3 row(s)
hive (default)>

使用數據庫

USE database_name;

表操做

建立表

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name     --表名
  [(col_name data_type [COMMENT col_comment],
    ... [constraint_specification])]  --列名 列數據類型
  [COMMENT table_comment]   --表描述
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  --分區表分區規則
  [
    CLUSTERED BY (col_name, col_name, ...) 
   [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
  ]  --分桶表分桶規則
  [SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)  
   [STORED AS DIRECTORIES] 
  ]  --指定傾斜列和值
  [
   [ROW FORMAT row_format]    
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  
  ]  -- 指定行分隔符、存儲文件格式或採用自定義存儲格式
  [LOCATION hdfs_path]  -- 指定表的存儲位置
  [TBLPROPERTIES (property_name=property_value, ...)]  --指定表的屬性
  [AS select_statement];   --從查詢結果建立表

CREATE TABLE emp(
empno int ,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive> CREATE TABLE emp(
    > empno int ,
    > ename string,
    > job string,
    > mgr int,
    > hiredate string,
    > sal double,
    > comm double,
    > deptno int
    > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.115 seconds
hive> desc formatted emp;
OK
# col_name            	data_type           	comment             
	 	 
empno               	int                 	                    
ename               	string              	                    
job                 	string              	                    
mgr                 	int                 	                    
hiredate            	string              	                    
sal                 	double              	                    
comm                	double              	                    
deptno              	int                 	                    
	 	 
# Detailed Table Information	 	 
Database:           	hive                	 
Owner:              	hadoop              	 
CreateTime:         	Wed Sep 09 09:34:57 CST 2020	 
LastAccessTime:     	UNKNOWN             	 
Protect Mode:       	None                	 
Retention:          	0                   	 
Location:           	hdfs://hadoop000:8020/user/hive/warehouse/hive.db/emp	 
Table Type:         	MANAGED_TABLE       	 
Table Parameters:	 	 
	transient_lastDdlTime	1599615297          
	 	 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	 
Compressed:         	No                  	 
Num Buckets:        	-1                  	 
Bucket Columns:     	[]                  	 
Sort Columns:       	[]                  	 
Storage Desc Params:	 	 
	field.delim         	\t                  
	serialization.format	\t                  
Time taken: 0.131 seconds, Fetched: 34 row(s)

加載數據

用DML的加載數據

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
 
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp;

hive> LOAD DATA LOCAL INPATH  '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp; 
Loading data to table hive.emp
Table hive.emp stats: [numFiles=1, totalSize=700]
OK
Time taken: 2.482 seconds
hive> select * from emp;
OK
7369	SMITH	CLERK	7902	1980-12-17	800.0	NULL	20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.0	300.0	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.0	500.0	30
7566	JONES	MANAGER	7839	1981-4-2	2975.0	NULL	20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.0	1400.0	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.0	NULL	30
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.0	NULL	20
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7844	TURNER	SALESMAN	7698	1981-9-8	1500.0	0.0	30
7876	ADAMS	CLERK	7788	1987-5-23	1100.0	NULL	20
7900	JAMES	CLERK	7698	1981-12-3	950.0	NULL	30
7902	FORD	ANALYST	7566	1981-12-3	3000.0	NULL	20
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
8888	HIVE	PROGRAM	7839	1988-1-23	10300.0	NULL	NULL
Time taken: 0.363 seconds, Fetched: 15 row(s)
hive>

更改表名

ALTER TABLE table_name RENAME TO new_table_name;

Hive DML

Hive Data Manipulation Language

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
 
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

LOCAL: 有就是從服務器目錄獲取文件, 無則從HDFS系統
OVERWRITE: 有表示新建數據 ; 無表示追加數據
INPATH
- a relative path, such as project/data1
- an absolute path, such as /user/hive/project/data1
- a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1

建立查詢表create table emp_1 as select * from emp;

導出數據

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select empno , ename ,sal,deptno from emp;

hive> 
    > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive'
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    > select empno , ename ,sal,deptno from emp;
Query ID = hadoop_20200909102020_aeb2ef7d-cf18-4bcb-b903-8c6ea1719626
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1599583423179_0001, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0001/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1599583423179_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-09-09 10:21:18,074 Stage-1 map = 0%,  reduce = 0%
2020-09-09 10:21:29,109 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.64 sec
MapReduce Total cumulative CPU time: 5 seconds 640 msec
Ended Job = job_1599583423179_0001
Copying data to local directory /tmp/hive
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 5.64 sec   HDFS Read: 4483 HDFS Write: 313 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 640 msec
OK
Time taken: 35.958 seconds
hive> 

[hadoop@hadoop000 hive]$ cat 000000_0 
7369	SMITH	800.0	20
7499	ALLEN	1600.0	30
7521	WARD	1250.0	30
7566	JONES	2975.0	20
7654	MARTIN	1250.0	30
7698	BLAKE	2850.0	30
7782	CLARK	2450.0	10
7788	SCOTT	3000.0	20
7839	KING	5000.0	10
7844	TURNER	1500.0	30
7876	ADAMS	1100.0	20
7900	JAMES	950.0	30
7902	FORD	3000.0	20
7934	MILLER	1300.0	10
8888	HIVE	10300.0	\N
[hadoop@hadoop000 hive]$

Hive QL

基本統計

和普通的sql並沒有兩樣

select * from emp where deptno=10;

聚合函數

像這種聚合(max,min,avg,sum)函數就是要跑mappreduce的

hive> select count(1) from emp where deptno=10;
Query ID = hadoop_20200909104949_1ce185de-2025-4633-9324-3e47f30fb157
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1599583423179_0002, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1599583423179_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-09-09 10:50:00,361 Stage-1 map = 0%,  reduce = 0%
2020-09-09 10:50:10,092 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.52 sec
2020-09-09 10:50:25,233 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.72 sec
MapReduce Total cumulative CPU time: 11 seconds 720 msec
Ended Job = job_1599583423179_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.72 sec   HDFS Read: 9708 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 720 msec
OK
3
Time taken: 38.666 seconds, Fetched: 1 row(s)
hive> select * from emp where deptno=10;
OK
7782	CLARK	MANAGER	7839	1981-6-9	2450.0	NULL	10
7839	KING	PRESIDENT	NULL	1981-11-17	5000.0	NULL	10
7934	MILLER	CLERK	7782	1982-1-23	1300.0	NULL	10
Time taken: 0.209 seconds, Fetched: 3 row(s)
hive>

分組函數

select deptno , avg(sal) from group by deptno;
注意select的字段沒有再聚合函數就要出現再group by 裏面

join的使用

用於涉及到多表

CREATE TABLE dept(
deptno int,
dname string,
loc string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' OVERWRITE INTO TABLE dept;

select e.empno,e.ename,e.sal,e.deptno,d.dname
from emp e join dept d
on e.deptno=d.deptno;

hive> select e.empno,e.ename,e.sal,e.deptno,d.dname
    > from emp e join dept d
    > on e.deptno=d.deptno;
Query ID = hadoop_20200909140808_8635204d-8e8a-4267-8503-ef242f022ebc
Total jobs = 1
2020-09-09 02:08:51	Starting to launch local task to process map join;	maximum memory = 477626368
2020-09-09 02:08:54	End of local task; Time Taken: 3.023 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1599583423179_0004, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0004/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job  -kill job_1599583423179_0004
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2020-09-09 14:09:06,852 Stage-3 map = 0%,  reduce = 0%
2020-09-09 14:09:18,823 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 6.7 sec
MapReduce Total cumulative CPU time: 6 seconds 700 msec
Ended Job = job_1599583423179_0004
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 6.7 sec   HDFS Read: 7649 HDFS Write: 406 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 700 msec
OK
7369	SMITH	800.0	20	RESEARCH
7499	ALLEN	1600.0	30	SALES
7521	WARD	1250.0	30	SALES
7566	JONES	2975.0	20	RESEARCH
7654	MARTIN	1250.0	30	SALES
7698	BLAKE	2850.0	30	SALES
7782	CLARK	2450.0	10	ACCOUNTING
7788	SCOTT	3000.0	20	RESEARCH
7839	KING	5000.0	10	ACCOUNTING
7844	TURNER	1500.0	30	SALES
7876	ADAMS	1100.0	20	RESEARCH
7900	JAMES	950.0	30	SALES
7902	FORD	3000.0	20	RESEARCH
7934	MILLER	1300.0	10	ACCOUNTING
Time taken: 46.765 seconds, Fetched: 14 row(s)
hive>

執行計劃

explain
select e.empno,e.ename,e.sal,e.deptno,d.dname
from emp e join dept d
on e.deptno=d.deptno;

hive> explain
    > select e.empno,e.ename,e.sal,e.deptno,d.dname
    > from emp e join dept d
    > on e.deptno=d.deptno;
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: e
            Statistics: Num rows: 6 Data size: 700 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 3 Data size: 350 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 deptno (type: int)
                  1 deptno (type: int)
                outputColumnNames: _col0, _col1, _col5, _col7, _col12
                Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: string), _col5 (type: double), _col7 (type: int), _col12 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4
                  Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink