源碼見:https://github.com/hiszm/hadoop-trainjava
http://hive.apache.org/node
The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.mysql
Hive 是一個構建在 Hadoop 之上的數據倉庫,它能夠將結構化的數據文件映射成表,並提供類 SQL 查詢功能,用於查詢的 SQL 語句會被轉化爲 MapReduce 做業,而後提交到 Hadoop 上運行。git
MappReduce
編程不便RDBMS
關係型數據庫schema
的概念(schema就是數據庫對象的集合 , 所謂的數據庫對象也就是常說的表,索引,視圖,存儲過程等。)github
SQL
的查詢語言 HQL
),使得精通 sql 可是不瞭解 Java 編程的人也能很好地進行大數據分析;元數據
管理,可與presto
/impala
/sparksql
等 共享 數據;client
: shell
, jdbc
, webUI(zeppelin)
metastore
: 指數據庫中的元數據Hive | RDBMS | |
---|---|---|
查詢語言 | Hive SQL | SQL |
數據儲存 | HDFS | Raw Device or Local FS |
索引 | 無(支持比較弱) | 有 |
執行 | MapReduce、 Tez | Excutor |
執行時延 | 高,離線 | 低 , 在線 |
數據規模 | 很是大, 大 | 小 |
wget hive-1.1.0-cdh5.15.1.tar.gz(url)
tar -zxvf hive-1.1.0-cdh5.15.1.tar.gz -C ~/app/
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.15.1 export PATH=$HIVE_HOME/bin:$PATH
source ~/.bash_profile
[hadoop@hadoop000 app]$ source ~/.bash_profile [hadoop@hadoop000 app]$ echo $HIVE_HOME /home/hadoop/app/hive-1.1.0-cdh5.15.1
/conf/hive-env.sh
增長一行
web
/conf/hive-site.xml
添加文件sql
[hadoop@hadoop000 conf]$ cat hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop000:3306/hadoop_hive?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration> [hadoop@hadoop000 conf]$
mysql-connector-java-5.1.27-bin.jar
拷貝到目錄home/hadoop/app/hive-1.1.0-cdh5.15.1/lib
shell
用yum
安裝[hadoop@hadoop000 lib]$ mysql -uroot -proot Warning: Using a password on the command line interface can be insecure. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 Server version: 5.6.42 MySQL Community Server (GPL) Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
[hadoop@hadoop000 sbin]$ jps 3218 SecondaryNameNode 3048 DataNode 3560 NodeManager 3451 ResourceManager 2940 NameNode 3599 Jps
hive> create database test > ; OK
select * from DBS\G;
數據庫
mysql> mysql> select * from DBS\G; *************************** 1. row *************************** DB_ID: 1 DESC: Default Hive database DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse NAME: default OWNER_NAME: public OWNER_TYPE: ROLE *************************** 2. row *************************** DB_ID: 3 DESC: NULL DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse/hive.db NAME: hive OWNER_NAME: hadoop OWNER_TYPE: USER *************************** 3. row *************************** DB_ID: 4 DESC: NULL DB_LOCATION_URI: hdfs://hadoop000:8020/test/location NAME: hive2 OWNER_NAME: hadoop OWNER_TYPE: USER *************************** 4. row *************************** DB_ID: 6 DESC: NULL DB_LOCATION_URI: hdfs://hadoop000:8020/user/hive/warehouse/test.db NAME: test OWNER_NAME: hadoop OWNER_TYPE: USER 4 rows in set (0.00 sec) ERROR: No query specified
Hive DDL
=Hive Data Definition Language
express
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [MANAGEDLOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];
Drop Database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
hive> create DATABASE hive_test; OK Time taken: 0.154 seconds hive>
再HDFS上的 默認路徑 /user/hive/warehouse/hive_test.db
默認的hive數據庫沒有default.db
的路徑/user/hive/warehouse/
自定義建立路徑
hive> create DATABASE hive_test2 LOCATION '/test/hive' > ; OK Time taken: 0.119 seconds hive> [hadoop@hadoop000 network-scripts]$ hadoop fs -ls /test/ Found 1 items drwxr-xr-x - hadoop supergroup 0 2020-09-09 06:29 /test/hive
自定義參數
DESC DATABASE [EXTENDED] db_name; --EXTENDED 表示是否顯示額外屬性
hive> create DATABASE hive_test3 LOCATION '/test/hive' > with DBPROPERTIES('creator'='jack'); OK Time taken: 0.078 seconds hive> desc database hive_test3 > ; OK hive_test3 hdfs://hadoop000:8020/test/hive hadoop USER Time taken: 0.048 seconds, Fetched: 1 row(s) hive> desc database extended hive_test3; OK hive_test3 hdfs://hadoop000:8020/test/hive hadoop USER {creator=jack} Time taken: 0.018 seconds, Fetched: 1 row(s) hive>
顯示當前目錄
hive> set hive.cli.print.current.db; hive.cli.print.current.db=false hive> set hive.cli.print.current.db=true; hive (default)>
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
hive (default)> show databases; OK default hive hive2 hive_test hive_test3 test Time taken: 0.02 seconds, Fetched: 6 row(s) hive (default)> drop database hive_test3; OK Time taken: 0.099 seconds hive (default)> show databases; OK default hive hive2 hive_test test Time taken: 0.019 seconds, Fetched: 5 row(s) hive (default)>
hive (default)> show databases like 'hive*'; OK hive hive2 hive_test Time taken: 0.024 seconds, Fetched: 3 row(s) hive (default)>
USE database_name;
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name --表名 [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] --列名 列數據類型 [COMMENT table_comment] --表描述 [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] --分區表分區規則 [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] --分桶表分桶規則 [SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] ] --指定傾斜列和值 [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ] -- 指定行分隔符、存儲文件格式或採用自定義存儲格式 [LOCATION hdfs_path] -- 指定表的存儲位置 [TBLPROPERTIES (property_name=property_value, ...)] --指定表的屬性 [AS select_statement]; --從查詢結果建立表
CREATE TABLE emp( empno int , ename string, job string, mgr int, hiredate string, sal double, comm double, deptno int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> CREATE TABLE emp( > empno int , > ename string, > job string, > mgr int, > hiredate string, > sal double, > comm double, > deptno int > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; OK Time taken: 0.115 seconds hive> desc formatted emp; OK # col_name data_type comment empno int ename string job string mgr int hiredate string sal double comm double deptno int # Detailed Table Information Database: hive Owner: hadoop CreateTime: Wed Sep 09 09:34:57 CST 2020 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://hadoop000:8020/user/hive/warehouse/hive.db/emp Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1599615297 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: field.delim \t serialization.format \t Time taken: 0.131 seconds, Fetched: 34 row(s)
用DML的加載數據
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp;
hive> LOAD DATA LOCAL INPATH '/home/hadoop/data/emp.txt' OVERWRITE INTO TABLE emp; Loading data to table hive.emp Table hive.emp stats: [numFiles=1, totalSize=700] OK Time taken: 2.482 seconds hive> select * from emp; OK 7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20 7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30 7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30 7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20 7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30 7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30 7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20 7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30 7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL Time taken: 0.363 seconds, Fetched: 15 row(s) hive>
ALTER TABLE table_name RENAME TO new_table_name;
Hive Data Manipulation Language
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
LOCAL: 有 就是從服務器目錄獲取文件, 無則從HDFS系統
OVERWRITE: 有 表示新建數據 ; 無 表示追加數據
INPATH
project/data1
s /user/hive/project/data1
hdfs://namenode:9000/user/hive/project/data1
建立查詢表create table emp_1 as select * from emp;
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select empno , ename ,sal,deptno from emp;
hive> > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive' > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > select empno , ename ,sal,deptno from emp; Query ID = hadoop_20200909102020_aeb2ef7d-cf18-4bcb-b903-8c6ea1719626 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1599583423179_0001, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0001/ Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1599583423179_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-09-09 10:21:18,074 Stage-1 map = 0%, reduce = 0% 2020-09-09 10:21:29,109 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.64 sec MapReduce Total cumulative CPU time: 5 seconds 640 msec Ended Job = job_1599583423179_0001 Copying data to local directory /tmp/hive MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 5.64 sec HDFS Read: 4483 HDFS Write: 313 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 640 msec OK Time taken: 35.958 seconds hive> [hadoop@hadoop000 hive]$ cat 000000_0 7369 SMITH 800.0 20 7499 ALLEN 1600.0 30 7521 WARD 1250.0 30 7566 JONES 2975.0 20 7654 MARTIN 1250.0 30 7698 BLAKE 2850.0 30 7782 CLARK 2450.0 10 7788 SCOTT 3000.0 20 7839 KING 5000.0 10 7844 TURNER 1500.0 30 7876 ADAMS 1100.0 20 7900 JAMES 950.0 30 7902 FORD 3000.0 20 7934 MILLER 1300.0 10 8888 HIVE 10300.0 \N [hadoop@hadoop000 hive]$
和普通的sql並沒有兩樣
select * from emp where deptno=10;
像這種聚合(max,min,avg,sum)函數就是要跑
mappreduce
的
hive> select count(1) from emp where deptno=10; Query ID = hadoop_20200909104949_1ce185de-2025-4633-9324-3e47f30fb157 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1599583423179_0002, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0002/ Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1599583423179_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-09-09 10:50:00,361 Stage-1 map = 0%, reduce = 0% 2020-09-09 10:50:10,092 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.52 sec 2020-09-09 10:50:25,233 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.72 sec MapReduce Total cumulative CPU time: 11 seconds 720 msec Ended Job = job_1599583423179_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.72 sec HDFS Read: 9708 HDFS Write: 2 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 720 msec OK 3 Time taken: 38.666 seconds, Fetched: 1 row(s) hive> select * from emp where deptno=10; OK 7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10 7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10 7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10 Time taken: 0.209 seconds, Fetched: 3 row(s) hive>
select deptno , avg(sal) from group by deptno;
注意select
的字段沒有再聚合函數就要出現再group by
裏面
用於涉及到多表
CREATE TABLE dept( deptno int, dname string, loc string )ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/data/dept.txt' OVERWRITE INTO TABLE dept;
select e.empno,e.ename,e.sal,e.deptno,d.dname from emp e join dept d on e.deptno=d.deptno;
hive> select e.empno,e.ename,e.sal,e.deptno,d.dname > from emp e join dept d > on e.deptno=d.deptno; Query ID = hadoop_20200909140808_8635204d-8e8a-4267-8503-ef242f022ebc Total jobs = 1 2020-09-09 02:08:51 Starting to launch local task to process map join; maximum memory = 477626368 2020-09-09 02:08:54 End of local task; Time Taken: 3.023 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1599583423179_0004, Tracking URL = http://hadoop000:8088/proxy/application_1599583423179_0004/ Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.15.1/bin/hadoop job -kill job_1599583423179_0004 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2020-09-09 14:09:06,852 Stage-3 map = 0%, reduce = 0% 2020-09-09 14:09:18,823 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 6.7 sec MapReduce Total cumulative CPU time: 6 seconds 700 msec Ended Job = job_1599583423179_0004 MapReduce Jobs Launched: Stage-Stage-3: Map: 1 Cumulative CPU: 6.7 sec HDFS Read: 7649 HDFS Write: 406 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 700 msec OK 7369 SMITH 800.0 20 RESEARCH 7499 ALLEN 1600.0 30 SALES 7521 WARD 1250.0 30 SALES 7566 JONES 2975.0 20 RESEARCH 7654 MARTIN 1250.0 30 SALES 7698 BLAKE 2850.0 30 SALES 7782 CLARK 2450.0 10 ACCOUNTING 7788 SCOTT 3000.0 20 RESEARCH 7839 KING 5000.0 10 ACCOUNTING 7844 TURNER 1500.0 30 SALES 7876 ADAMS 1100.0 20 RESEARCH 7900 JAMES 950.0 30 SALES 7902 FORD 3000.0 20 RESEARCH 7934 MILLER 1300.0 10 ACCOUNTING Time taken: 46.765 seconds, Fetched: 14 row(s) hive>
explain select e.empno,e.ename,e.sal,e.deptno,d.dname from emp e join dept d on e.deptno=d.deptno;
hive> explain > select e.empno,e.ename,e.sal,e.deptno,d.dname > from emp e join dept d > on e.deptno=d.deptno; OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: d Fetch Operator limit: -1 Alias -> Map Local Operator Tree: d TableScan alias: d Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: deptno is not null (type: boolean) Statistics: Num rows: 1 Data size: 79 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator keys: 0 deptno (type: int) 1 deptno (type: int) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: e Statistics: Num rows: 6 Data size: 700 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: deptno is not null (type: boolean) Statistics: Num rows: 3 Data size: 350 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 deptno (type: int) 1 deptno (type: int) outputColumnNames: _col0, _col1, _col5, _col7, _col12 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col5 (type: double), _col7 (type: int), _col12 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 3 Data size: 385 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink