大數據技術之Hive

時間 2020-05-08

標籤數據技術 hive 欄目 Hadoop 简体版

原文原文鏈接

第1章 Hive入門

1.1 什麼是Hive

Hive：由Facebook開源用於解決海量結構化日誌的數據統計。 java

Hive是基於Hadoop的一個數據倉庫工具，能夠將結構化的數據文件映射爲一張表，並提供類SQL查詢功能。 node

本質是：將HQL轉化成MapReduce程序 mysql

1）Hive處理的數據存儲在HDFS linux

2）Hive分析數據底層的實現是MapReduce git

3）執行程序運行在Yarn上 github

1.2 Hive的優缺點

1.2.1 優勢

操做接口採用類SQL語法，提供快速開發的能力（簡單、容易上手）。
避免了去寫MapReduce，減小開發人員的學習成本。
Hive的執行延遲比較高，所以Hive經常使用於數據分析，對實時性要求不高的場合。
Hive優點在於處理大數據，對於處理小數據沒有優點，由於Hive的執行延遲比較高。
Hive支持用戶自定義函數，用戶能夠根據本身的需求來實現本身的函數。

1.2.2 缺點

1．Hive的HQL表達能力有限 web

（1）迭代式算法沒法表達正則表達式

（2）數據挖掘方面不擅長算法

2．Hive的效率比較低 sql

（1）Hive自動生成的MapReduce做業，一般狀況下不夠智能化

（2）Hive調優比較困難，粒度較粗

1.3 Hive架構原理

圖6-1 Hive架構原理

1．用戶接口：Client

CLI（hive shell）、JDBC/ODBC(java訪問hive)、WEBUI（瀏覽器訪問hive）

2．元數據：Metastore

元數據包括：表名、表所屬的數據庫（默認是default）、表的擁有者、列/分區字段、表的類型（是不是外部表）、表的數據所在目錄等；

默認存儲在自帶的derby數據庫中，推薦使用MySQL存儲Metastore

3．Hadoop

使用HDFS進行存儲，使用MapReduce進行計算。

4．驅動器：Driver

（1）解析器（SQL Parser）：將SQL字符串轉換成抽象語法樹AST，這一步通常都用第三方工具庫完成，好比antlr；對AST進行語法分析，好比表是否存在、字段是否存在、SQL語義是否有誤。

（2）編譯器（Physical Plan）：將AST編譯生成邏輯執行計劃。

（3）優化器（Query Optimizer）：對邏輯執行計劃進行優化。

（4）執行器（Execution）：把邏輯執行計劃轉換成能夠運行的物理計劃。對於Hive來講，就是MR/Spark。

Hive經過給用戶提供的一系列交互接口，接收到用戶的指令(SQL)，使用本身的Driver，結合元數據(MetaStore)，將這些指令翻譯成MapReduce，提交到Hadoop中執行，最後，將執行返回的結果輸出到用戶交互接口。

1.4 Hive和數據庫比較

因爲 Hive 採用了相似SQL 的查詢語言 HQL(Hive Query Language)，所以很容易將 Hive 理解爲數據庫。其實從結構上來看，Hive 和數據庫除了擁有相似的查詢語言，再無相似之處。本文將從多個方面來闡述 Hive 和數據庫的差別。數據庫能夠用在 Online 的應用中，可是Hive 是爲數據倉庫而設計的，清楚這一點，有助於從應用角度理解 Hive 的特性。

1.4.1 查詢語言

因爲SQL被普遍的應用在數據倉庫中，所以，專門針對Hive的特性設計了類SQL的查詢語言HQL。熟悉SQL開發的開發者能夠很方便的使用Hive進行開發。

1.4.2 數據存儲位置

Hive 是創建在 Hadoop 之上的，全部 Hive 的數據都是存儲在 HDFS 中的。而數據庫則能夠將數據保存在塊設備或者本地文件系統中。

1.4.3 數據更新

因爲Hive是針對數據倉庫應用設計的，而數據倉庫的內容是讀多寫少的。所以，Hive中不建議對數據的改寫，全部的數據都是在加載的時候肯定好的。而數據庫中的數據一般是須要常常進行修改的，所以可使用 INSERT INTO … VALUES 添加數據，使用 UPDATE … SET修改數據。

1.4.4 索引

Hive在加載數據的過程當中不會對數據進行任何處理，甚至不會對數據進行掃描，所以也沒有對數據中的某些Key創建索引。Hive要訪問數據中知足條件的特定值時，須要暴力掃描整個數據，所以訪問延遲較高。因爲 MapReduce 的引入， Hive 能夠並行訪問數據，所以即便沒有索引，對於大數據量的訪問，Hive 仍然能夠體現出優點。數據庫中，一般會針對一個或者幾個列創建索引，所以對於少許的特定條件的數據的訪問，數據庫能夠有很高的效率，較低的延遲。因爲數據的訪問延遲較高，決定了 Hive 不適合在線數據查詢。

1.4.5 執行

Hive中大多數查詢的執行是經過 Hadoop 提供的 MapReduce 來實現的。而數據庫一般有本身的執行引擎。

1.4.6 執行延遲

Hive 在查詢數據的時候，因爲沒有索引，須要掃描整個表，所以延遲較高。另一個致使 Hive 執行延遲高的因素是 MapReduce框架。因爲MapReduce 自己具備較高的延遲，所以在利用MapReduce 執行Hive查詢時，也會有較高的延遲。相對的，數據庫的執行延遲較低。固然，這個低是有條件的，即數據規模較小，當數據規模大到超過數據庫的處理能力的時候，Hive的並行計算顯然能體現出優點。

1.4.7 可擴展性

因爲Hive是創建在Hadoop之上的，所以Hive的可擴展性是和Hadoop的可擴展性是一致的（世界上最大的Hadoop 集羣在 Yahoo!，2009年的規模在4000 臺節點左右）。而數據庫因爲 ACID 語義的嚴格限制，擴展行很是有限。目前最早進的並行數據庫 Oracle 在理論上的擴展能力也只有100臺左右。

1.4.8 數據規模

因爲Hive創建在集羣上並能夠利用MapReduce進行並行計算，所以能夠支持很大規模的數據；對應的，數據庫能夠支持的數據規模較小。

第2章 Hive安裝

2.1 Hive安裝地址

1．Hive官網地址

http://hive.apache.org/

2．文檔查看地址

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

3．下載地址

http://archive.apache.org/dist/hive/

4．github地址

https://github.com/apache/hive

2.2 Hive安裝部署

1．Hive安裝及配置

（1）把apache-hive-1.2.1-bin.tar.gz上傳到linux的/opt/software目錄下

（2）解壓apache-hive-1.2.1-bin.tar.gz到/opt/module/目錄下面

[atguigu@hadoop102 software]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/module/

（3）修改apache-hive-1.2.1-bin.tar.gz的名稱爲hive

[atguigu@hadoop102 module]$ mv apache-hive-1.2.1-bin/ hive

（4）修改/opt/module/hive/conf目錄下的hive-env.sh.template名稱爲hive-env.sh

[atguigu@hadoop102 conf]$ mv hive-env.sh.template hive-env.sh

（5）配置hive-env.sh文件

（a）配置HADOOP_HOME路徑

export HADOOP_HOME=/opt/module/hadoop-2.7.2

（b）配置HIVE_CONF_DIR路徑

export HIVE_CONF_DIR=/opt/module/hive/conf

2．Hadoop集羣配置

（1）必須啓動hdfs和yarn

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh

[atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

（2）在HDFS上建立/tmp和/user/hive/warehouse兩個目錄並修改他們的同組權限可寫

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir /tmp

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir -p /user/hive/warehouse

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /tmp

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /user/hive/warehouse

3．Hive基本操做

（1）啓動hive

[atguigu@hadoop102 hive]$ bin/hive

（2）查看數據庫

hive> show databases;

（3）打開默認數據庫

hive> use default;

（4）顯示default數據庫中的表

hive> show tables;

（5）建立一張表

hive> create table student(id int, name string);

（6）顯示數據庫中有幾張表

hive> show tables;

（7）查看錶的結構

hive> desc student;

（8）向表中插入數據

hive> insert into student values(1000,"ss");

（9）查詢表中數據

hive> select * from student;

（10）退出hive

hive> quit;

2.3 將本地文件導入Hive案例

需求

將本地/opt/module/datas/student.txt這個目錄下的數據導入到hive的student(id int, name string)表中。

1．數據準備

在/opt/module/datas這個目錄下準備數據

（1）在/opt/module/目錄下建立datas

[atguigu@hadoop102 module]$ mkdir datas

（2）在/opt/module/datas/目錄下建立student.txt文件並添加數據

[atguigu@hadoop102 datas]$ touch student.txt

[atguigu@hadoop102 datas]$ vi student.txt

1001 zhangshan

1002 lishi

1003 zhaoliu

注意以tab鍵間隔。

2．Hive實際操做

（1）啓動hive

[atguigu@hadoop102 hive]$ bin/hive

（2）顯示數據庫

hive> show databases;

（3）使用default數據庫

hive> use default;

（4）顯示default數據庫中的表

hive> show tables;

（5）刪除已建立的student表

hive> drop table student;

（6）建立student表, 並聲明文件分隔符'\t'

hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED

BY '\t';

（7）加載/opt/module/datas/student.txt 文件到student數據庫表中。

hive> load data local inpath '/opt/module/datas/student.txt' into table student;

（8）Hive查詢結果

hive> select * from student;

1001 zhangshan

1002 lishi

1003 zhaoliu

Time taken: 0.266 seconds, Fetched: 3 row(s)

3．遇到的問題

再打開一個客戶端窗口啓動hive，會產生java.sql.SQLException異常。

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException:

Unable to instantiate

org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)

at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

... 8 more

緣由是，Metastore默認存儲在自帶的derby數據庫中，推薦使用MySQL存儲Metastore;

2.4 MySql安裝

2.4.1 安裝包準備

1．查看mysql是否安裝，若是安裝了，卸載mysql

（1）查看

[root@hadoop102 桌面]# rpm -qa|grep mysql

mysql-libs-5.1.73-7.el6.x86_64

（2）卸載

[root@hadoop102 桌面]# rpm -e --nodeps mysql-libs-5.1.73-7.el6.x86_64

2．解壓mysql-libs.zip文件到當前目錄

[root@hadoop102 software]# unzip mysql-libs.zip

[root@hadoop102 software]# ls

mysql-libs.zip

mysql-libs

3．進入到mysql-libs文件夾下

[root@hadoop102 mysql-libs]# ll

總用量 76048

-rw-r--r--. 1 root root 18509960 3月 26 2015 MySQL-client-5.6.24-1.el6.x86_64.rpm

-rw-r--r--. 1 root root 3575135 12月 1 2013 mysql-connector-java-5.1.27.tar.gz

-rw-r--r--. 1 root root 55782196 3月 26 2015 MySQL-server-5.6.24-1.el6.x86_64.rpm

2.4.2 安裝MySql服務器

1．安裝mysql服務端

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm

2．查看產生的隨機密碼

[root@hadoop102 mysql-libs]# cat /root/.mysql_secret

OEXaQuS8IWkG19Xs

3．查看mysql狀態

[root@hadoop102 mysql-libs]# service mysql status

4．啓動mysql

[root@hadoop102 mysql-libs]# service mysql start

2.4.3 安裝MySql客戶端

1．安裝mysql客戶端

[root@hadoop102 mysql-libs]# rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm

2．連接mysql

[root@hadoop102 mysql-libs]# mysql -uroot -pOEXaQuS8IWkG19Xs

3．修改密碼

mysql>SET PASSWORD=PASSWORD('000000');

4．退出mysql

mysql>exit

2.4.4 MySql中user表中主機配置

配置只要是root用戶+密碼，在任何主機上都能登陸MySQL數據庫。

1．進入mysql

[root@hadoop102 mysql-libs]# mysql -uroot -p000000

2．顯示數據庫

mysql>show databases;

3．使用mysql數據庫

mysql>use mysql;

4．展現mysql數據庫中的全部表

mysql>show tables;

5．展現user表的結構

mysql>desc user;

6．查詢user表

mysql>select User, Host, Password from user;

7．修改user表，把Host表內容修改成%

mysql>update user set host='%' where host='localhost';

8．刪除root用戶的其餘host

mysql>delete from user where Host='hadoop102';

mysql>delete from user where Host='127.0.0.1';

mysql>delete from user where Host='::1';

9．刷新

mysql>flush privileges;

10．退出

mysql>quit;

2.5 Hive元數據配置到MySql

2.5.1 驅動拷貝

1．在/opt/software/mysql-libs目錄下解壓mysql-connector-java-5.1.27.tar.gz驅動包

[root@hadoop102 mysql-libs]# tar -zxvf mysql-connector-java-5.1.27.tar.gz

2．拷貝/opt/software/mysql-libs/mysql-connector-java-5.1.27目錄下的mysql-connector-java-5.1.27-bin.jar到/opt/module/hive/lib/

[root@hadoop102 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar

/opt/module/hive/lib/

2.5.2 配置Metastore到MySql

1．在/opt/module/hive/conf目錄下建立一個hive-site.xml

[atguigu@hadoop102 conf]$ touch hive-site.xml

[atguigu@hadoop102 conf]$ vi hive-site.xml

2．根據官方文檔配置參數，拷貝數據到hive-site.xml文件中

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<description>username to use against metastore database</description>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<description>password to use against metastore database</description>

</property>

</configuration>

3．配置完畢後，若是啓動hive異常，能夠從新啓動虛擬機。（重啓後，別忘了啓動hadoop集羣）

2.5.3 多窗口啓動Hive測試

1．先啓動MySQL

[atguigu@hadoop102 mysql-libs]$ mysql -uroot -p000000

查看有幾個數據庫

mysql> show databases;

+--------------------+

| Database |

+--------------------+

| information_schema |

| mysql |

| performance_schema |

| test |

+--------------------+

2．再次打開多個窗口，分別啓動hive

[atguigu@hadoop102 hive]$ bin/hive

3．啓動hive後，回到MySQL窗口查看數據庫，顯示增長了metastore數據庫

mysql> show databases;

+--------------------+

| Database |

+--------------------+

| information_schema |

| metastore |

| mysql |

| performance_schema |

| test |

+--------------------+

2.6 HiveJDBC訪問

2.6.1 啓動hiveserver2服務

[atguigu@hadoop102 hive]$ bin/hiveserver2

2.6.2 啓動beeline

[atguigu@hadoop102 hive]$ bin/beeline

Beeline version 1.2.1 by Apache Hive

beeline>

2.6.3 鏈接hiveserver2

beeline> !connect jdbc:hive2://hadoop102:10000（回車）

Connecting to jdbc:hive2://hadoop102:10000

Enter username for jdbc:hive2://hadoop102:10000: atguigu（回車）

Enter password for jdbc:hive2://hadoop102:10000: （直接回車）

Connected to: Apache Hive (version 1.2.1)

Driver: Hive JDBC (version 1.2.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://hadoop102:10000> show databases;

+----------------+--+

| database_name |

+----------------+--+

| default |

| hive_db2 |

+----------------+--+

2.7 Hive經常使用交互命令

[atguigu@hadoop102 hive]$ bin/hive -help

usage: hive

-d,--define <key=value> Variable subsitution to apply to hive

commands. e.g. -d A=B or --define A=B

--database <databasename> Specify the database to use

-e <quoted-query-string> SQL from command line

-f <filename> SQL from files

-H,--help Print help information

--hiveconf <property=value> Use value for given property

--hivevar <key=value> Variable subsitution to apply to hive

commands. e.g. --hivevar A=B

-i <filename> Initialization SQL file

-S,--silent Silent mode in interactive shell

-v,--verbose Verbose mode (echo executed SQL to the console)

1．"-e"不進入hive的交互窗口執行sql語句

[atguigu@hadoop102 hive]$ bin/hive -e "select id from student;"

2．"-f"執行腳本中sql語句

（1）在/opt/module/datas目錄下建立hivef.sql文件

[atguigu@hadoop102 datas]$ touch hivef.sql

文件中寫入正確的sql語句

select *from student;

（2）執行文件中的sql語句

[atguigu@hadoop102 hive]$ bin/hive -f /opt/module/datas/hivef.sql

（3）執行文件中的sql語句並將結果寫入文件中

[atguigu@hadoop102 hive]$ bin/hive -f /opt/module/datas/hivef.sql > /opt/module/datas/hive_result.txt

2.8 Hive其餘命令操做

1．退出hive窗口：

hive(default)>exit;

hive(default)>quit;

在新版的hive中沒區別了，在之前的版本是有的：

exit:先隱性提交數據，再退出；

quit:不提交數據，退出；

2．在hive cli命令窗口中如何查看hdfs文件系統

hive(default)>dfs -ls /;

3．在hive cli命令窗口中如何查看本地文件系統

hive(default)>! ls /opt/module/datas;

4．查看在hive中輸入的全部歷史命令

（1）進入到當前用戶的根目錄/root或/home/atguigu

（2）查看. hivehistory文件

[atguigu@hadoop102 ~]$ cat .hivehistory

2.9 Hive常見屬性配置

2.9.1 Hive數據倉庫位置配置

1）Default數據倉庫的最原始位置是在hdfs上的：/user/hive/warehouse路徑下。

2）在倉庫目錄下，沒有對默認的數據庫default建立文件夾。若是某張表屬於default數據庫，直接在數據倉庫目錄下建立一個文件夾。

3）修改default數據倉庫原始位置（將hive-default.xml.template以下配置信息拷貝到hive-site.xml文件中）。

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

<description>location of default database for the warehouse</description>

</property>

配置同組用戶有執行權限

bin/hdfs dfs -chmod g+w /user/hive/warehouse

2.9.2 查詢後信息顯示配置

1）在hive-site.xml文件中添加以下配置信息，就能夠實現顯示當前數據庫，以及查詢表的頭信息配置。

<name>hive.cli.print.header</name>

</property>

<name>hive.cli.print.current.db</name>

</property>

2）從新啓動hive，對比配置先後差別。

（1）配置前，如圖6-2所示

圖6-2 配置前

（2）配置後，如圖6-3所示

圖6-3 配置後

2.9.3 Hive運行日誌信息配置

1．Hive的log默認存放在/tmp/atguigu/hive.log目錄下（當前用戶名下）

2．修改hive的log存放日誌到/opt/module/hive/logs

（1）修改/opt/module/hive/conf/hive-log4j.properties.template文件名稱爲

hive-log4j.properties

[atguigu@hadoop102 conf]$ pwd

/opt/module/hive/conf

[atguigu@hadoop102 conf]$ mv hive-log4j.properties.template hive-log4j.properties

（2）在hive-log4j.properties文件中修改log存放位置

hive.log.dir=/opt/module/hive/logs

2.9.4 參數配置方式

1．查看當前全部的配置信息

hive>set;

2．參數的配置三種方式

（1）配置文件方式

默認配置文件：hive-default.xml

用戶自定義配置文件：hive-site.xml

注意：用戶自定義配置會覆蓋默認配置。另外，Hive也會讀入Hadoop的配置，由於Hive是做爲Hadoop的客戶端啓動的，Hive的配置會覆蓋Hadoop的配置。配置文件的設定對本機啓動的全部Hive進程都有效。

（2）命令行參數方式

啓動Hive時，能夠在命令行添加-hiveconf param=value來設定參數。

例如：

[atguigu@hadoop103 hive]$ bin/hive -hiveconf mapred.reduce.tasks=10;

注意：僅對本次hive啓動有效

查看參數設置：

hive (default)> set mapred.reduce.tasks;

（3）參數聲明方式

能夠在HQL中使用SET關鍵字設定參數

例如：

hive (default)> set mapred.reduce.tasks=100;

注意：僅對本次hive啓動有效。

查看參數設置

hive (default)> set mapred.reduce.tasks;

上述三種設定方式的優先級依次遞增。即配置文件<命令行參數<參數聲明。注意某些系統級的參數，例如log4j相關的設定，必須用前兩種方式設定，由於那些參數的讀取在會話創建之前已經完成了。

第3章 Hive數據類型

3.1 基本數據類型

表6-1

Hive數據類型	Java數據類型	長度	例子
TINYINT	byte	1byte有符號整數	20
SMALINT	short	2byte有符號整數	20
INT	int	4byte有符號整數	20
BIGINT	long	8byte有符號整數	20
BOOLEAN	boolean	布爾類型，true或者false	TRUE FALSE
FLOAT	float	單精度浮點數	3.14159
DOUBLE	double	雙精度浮點數	3.14159
STRING	string	字符系列。能夠指定字符集。可使用單引號或者雙引號。	'now is the time' "for all good men"
TIMESTAMP		時間類型
BINARY		字節數組

對於Hive的String類型至關於數據庫的varchar類型，該類型是一個可變的字符串，不過它不能聲明其中最多能存儲多少個字符，理論上它能夠存儲2GB的字符數。

3.2 集合數據類型

表6-2

數據類型	描述	語法示例
STRUCT	和c語言中的struct相似，均可以經過"點"符號訪問元素內容。例如，若是某個列的數據類型是STRUCT{first STRING, last STRING},那麼第1個元素能夠經過字段.first來引用。	struct()
MAP	MAP是一組鍵-值對元組集合，使用數組表示法能夠訪問數據。例如，若是某個列的數據類型是MAP，其中鍵->值對是'first'->'John'和'last'->'Doe'，那麼能夠經過字段名['last']獲取最後一個元素	map()
ARRAY	數組是一組具備相同類型和名稱的變量的集合。這些變量稱爲數組的元素，每一個數組元素都有一個編號，編號從零開始。例如，數組值爲['John', 'Doe']，那麼第2個元素能夠經過數組名[1]進行引用。	Array()

Hive有三種複雜數據類型ARRAY、MAP 和 STRUCT。ARRAY和MAP與Java中的Array和Map相似，而STRUCT與C語言中的Struct相似，它封裝了一個命名字段集合，複雜數據類型容許任意層次的嵌套。

案例實操

假設某表有以下一行，咱們用JSON格式來表示其數據結構。在Hive下訪問的格式爲

{

"name": "songsong",

"friends": ["bingbing" , "lili"] , //列表Array,

"children": { //鍵值Map,

"xiao song": 18 ,

"xiaoxiao song": 19

}

"address": { //結構Struct,

"street": "hui long guan" ,

"city": "beijing"

}

2）基於上述數據結構，咱們在Hive裏建立對應的表，並導入數據。

建立本地測試文件test.txt

songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing

yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

注意：MAP，STRUCT和ARRAY裏的元素間關係均可以用同一個字符表示，這裏用"_"。

3）Hive上建立測試表test

create table test(

name string,

friends array<string>,

children map<string, int>,

address struct<street:string, city:string>

)

row format delimited fields terminated by ','

collection items terminated by '_'

map keys terminated by ':'

lines terminated by '\n';

字段解釋：

row format delimited fields terminated by ',' -- 列分隔符

collection items terminated by '_' --MAP STRUCT 和 ARRAY 的分隔符(數據分割符號)

map keys terminated by ':' -- MAP中的key與value的分隔符

lines terminated by '\n'; -- 行分隔符

4）導入文本數據到測試表

hive (default)> load data local inpath '/opt/module/datas/test.txt'into table test

5）訪問三種集合列裏的數據，如下分別是ARRAY，MAP，STRUCT的訪問方式

hive (default)> select friends[1],children['xiao song'],address.city from test

where name="songsong";

_c0 _c1 city

lili 18 beijing

Time taken: 0.076 seconds, Fetched: 1 row(s)

3.3 類型轉化

Hive的原子數據類型是能夠進行隱式轉換的，相似於Java的類型轉換，例如某表達式使用INT類型，TINYINT會自動轉換爲INT類型，可是Hive不會進行反向轉化，例如，某表達式使用TINYINT類型，INT不會自動轉換爲TINYINT類型，它會返回錯誤，除非使用CAST操做。

1．隱式類型轉換規則以下

（1）任何整數類型均可以隱式地轉換爲一個範圍更廣的類型，如TINYINT能夠轉換成INT，INT能夠轉換成BIGINT。

（2）全部整數類型、FLOAT和STRING類型均可以隱式地轉換成DOUBLE。

（3）TINYINT、SMALLINT、INT均可以轉換爲FLOAT。

（4）BOOLEAN類型不能夠轉換爲任何其它的類型。

2．可使用CAST操做顯示進行數據類型轉換

例如CAST('1' AS INT)將把字符串'1' 轉換成整數1；若是強制類型轉換失敗，如執行CAST('X' AS INT)，表達式返回空值 NULL。

第4章 DDL數據定義

4.1 建立數據庫

1）建立一個數據庫，數據庫在HDFS上的默認存儲路徑是/user/hive/warehouse/*.db。

hive (default)> create database db_hive;

2）避免要建立的數據庫已經存在錯誤，增長if not exists判斷。（標準寫法）

hive (default)> create database db_hive;

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database db_hive already exists

hive (default)> create database if not exists db_hive;

3）建立一個數據庫，指定數據庫在HDFS上存放的位置

hive (default)> create database db_hive2 location '/db_hive2.db';

圖6-4 數據庫存放位置

4.2 查詢數據庫

4.2.1 顯示數據庫

1．顯示數據庫

hive> show databases;

2．過濾顯示查詢的數據庫

hive> show databases like 'db_hive*';

db_hive

db_hive_1

4.2.2 查看數據庫詳情

1．顯示數據庫信息

hive> desc database db_hive;

db_hive hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db atguiguUSER

2．顯示數據庫詳細信息，extended

hive> desc database extended db_hive;

db_hive hdfs://hadoop102:9000/user/hive/warehouse/db_hive.db atguiguUSER

40.3.3 切換當前數據庫

hive (default)> use db_hive;

4.3.3 切換當前數據庫

hive (default)> use db_hive;

4.3 修改數據庫

用戶可使用ALTER DATABASE命令爲某個數據庫的DBPROPERTIES設置鍵-值對屬性值，來描述這個數據庫的屬性信息。數據庫的其餘元數據信息都是不可更改的，包括數據庫名和數據庫所在的目錄位置。

hive (default)> alter database db_hive set dbproperties('createtime'='20170830');

在hive中查看修改結果

hive> desc database extended db_hive;

db_name comment location owner_name owner_type parameters

db_hive hdfs://hadoop102:8020/user/hive/warehouse/db_hive.db atguigu USER {createtime=20170830}

4.4 刪除數據庫

1．刪除空數據庫

hive>drop database db_hive2;

2．若是刪除的數據庫不存在，最好採用 if exists判斷數據庫是否存在

hive> drop database db_hive;

FAILED: SemanticException [Error 10072]: Database does not exist: db_hive

hive> drop database if exists db_hive2;

3．若是數據庫不爲空，能夠採用cascade命令，強制刪除

hive> drop database db_hive;

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database db_hive is not empty. One or more tables exist.)

hive> drop database db_hive cascade;

4.5 建立表

1．建表語法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

2．字段解釋說明

（1）CREATE TABLE 建立一個指定名字的表。若是相同名字的表已經存在，則拋出異常；用戶能夠用 IF NOT EXISTS 選項來忽略這個異常。

（2）EXTERNAL關鍵字可讓用戶建立一個外部表，在建表的同時指定一個指向實際數據的路徑（LOCATION），Hive建立內部表時，會將數據移動到數據倉庫指向的路徑；若建立外部表，僅記錄數據所在的路徑，不對數據的位置作任何改變。在刪除表的時候，內部表的元數據和數據會被一塊兒刪除，而外部表只刪除元數據，不刪除數據。

（3）COMMENT：爲表和列添加註釋。

（4）PARTITIONED BY建立分區表

（5）CLUSTERED BY建立分桶表

（6）SORTED BY不經常使用

（7）ROW FORMAT

DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

用戶在建表的時候能夠自定義SerDe或者使用自帶的SerDe。若是沒有指定ROW FORMAT 或者ROW FORMAT DELIMITED，將會使用自帶的SerDe。在建表的時候，用戶還須要爲表指定列，用戶在指定表的列的同時也會指定自定義的SerDe，Hive經過SerDe肯定表的具體的列的數據。

SerDe是Serialize/Deserilize的簡稱，目的是用於序列化和反序列化。

（8）STORED AS指定存儲文件類型

經常使用的存儲文件類型：SEQUENCEFILE（二進制序列文件）、TEXTFILE（文本）、RCFILE（列式存儲格式文件）

若是文件數據是純文本，可使用STORED AS TEXTFILE。若是數據須要壓縮，使用 STORED AS SEQUENCEFILE。

（9）LOCATION ：指定表在HDFS上的存儲位置。

（10）LIKE容許用戶複製現有的表結構，可是不復制數據。

4.5.1 管理表

1．理論

默認建立的表都是所謂的管理表，有時也被稱爲內部表。由於這種表，Hive會（或多或少地）控制着數據的生命週期。Hive默認狀況下會將這些表的數據存儲在由配置項hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定義的目錄的子目錄下。當咱們刪除一個管理表時，Hive也會刪除這個表中數據。管理表不適合和其餘工具共享數據。

2．案例實操

（1）普通建立表

create table if not exists student2(

id int, name string

)

row format delimited fields terminated by '\t'

stored as textfile

location '/user/hive/warehouse/student2';

（2）根據查詢結果建立表（查詢的結果會添加到新建立的表中）

create table if not exists student3 as select id, name from student;

（3）根據已經存在的表結構建立表

create table if not exists student4 like student;

（4）查詢表的類型

hive (default)> desc formatted student2;

Table Type: MANAGED_TABLE

4.5.2 外部表

1．理論

由於表是外部表，因此Hive並不是認爲其徹底擁有這份數據。刪除該表並不會刪除掉這份數據，不過描述表的元數據信息會被刪除掉。

2．管理表和外部表的使用場景

天天將收集到的網站日誌按期流入HDFS文本文件。在外部表（原始日誌表）的基礎上作大量的統計分析，用到的中間表、結果表使用內部表存儲，數據經過SELECT+INSERT進入內部表。

3．案例實操

分別建立部門和員工外部表，並向表中導入數據。

（1）原始數據

（2）建表語句

建立部門表

create external table if not exists default.dept(

deptno int,

dname string,

loc int

)

row format delimited fields terminated by '\t';

建立員工表

create external table if not exists default.emp(

empno int,

ename string,

job string,

mgr int,

hiredate string,

sal double,

comm double,

deptno int)

row format delimited fields terminated by '\t';

（3）查看建立的表

hive (default)> show tables;

tab_name

dept

emp

（4）向外部表中導入數據

導入數據

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept;

hive (default)> load data local inpath '/opt/module/datas/emp.txt' into table default.emp;

查詢結果

hive (default)> select * from emp;

hive (default)> select * from dept;

（5）查看錶格式化數據

hive (default)> desc formatted dept;

Table Type: EXTERNAL_TABLE

4.5.3 管理表與外部表的互相轉換

（1）查詢表的類型

hive (default)> desc formatted student2;

Table Type: MANAGED_TABLE

（2）修改內部表student2爲外部表

alter table student2 set tblproperties('EXTERNAL'='TRUE');

（3）查詢表的類型

hive (default)> desc formatted student2;

Table Type: EXTERNAL_TABLE

（4）修改外部表student2爲內部表

alter table student2 set tblproperties('EXTERNAL'='FALSE');

（5）查詢表的類型

hive (default)> desc formatted student2;

Table Type: MANAGED_TABLE

注意：('EXTERNAL'='TRUE')和('EXTERNAL'='FALSE')爲固定寫法，區分大小寫！

4.6 分區表

分區表實際上就是對應一個HDFS文件系統上的獨立的文件夾，該文件夾下是該分區全部的數據文件。Hive中的分區就是分目錄，把一個大的數據集根據業務須要分割成小的數據集。在查詢時經過WHERE子句中的表達式選擇查詢所須要的指定的分區，這樣的查詢效率會提升不少。

4.6.1 分區表基本操做

1．引入分區表（須要根據日期對日誌進行管理）

/user/hive/warehouse/log_partition/20170702/20170702.log

/user/hive/warehouse/log_partition/20170703/20170703.log

/user/hive/warehouse/log_partition/20170704/20170704.log

2．建立分區表語法

hive (default)> create table dept_partition(

deptno int, dname string, loc string

)

partitioned by (month string)

row format delimited fields terminated by '\t';

3．加載數據到分區表中

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201709');

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201708');

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201707');

圖6-5 加載數據到分區表

圖6-6 分區表

4．查詢分區表中數據

單分區查詢

hive (default)> select * from dept_partition where month='201709';

多分區聯合查詢

hive (default)> select * from dept_partition where month='201709'

union

select * from dept_partition where month='201708'

union

select * from dept_partition where month='201707';

_u3.deptno _u3.dname _u3.loc _u3.month

10 ACCOUNTING NEW YORK 201707

10 ACCOUNTING NEW YORK 201708

10 ACCOUNTING NEW YORK 201709

20 RESEARCH DALLAS 201707

20 RESEARCH DALLAS 201708

20 RESEARCH DALLAS 201709

30 SALES CHICAGO 201707

30 SALES CHICAGO 201708

30 SALES CHICAGO 201709

40 OPERATIONS BOSTON 201707

40 OPERATIONS BOSTON 201708

40 OPERATIONS BOSTON 201709

5．增長分區

建立單個分區

hive (default)> alter table dept_partition add partition(month='201706') ;

同時建立多個分區

hive (default)> alter table dept_partition add partition(month='201705') partition(month='201704');

6．刪除分區

刪除單個分區

hive (default)> alter table dept_partition drop partition (month='201704');

同時刪除多個分區

hive (default)> alter table dept_partition drop partition (month='201705'), partition (month='201706');

7．查看分區表有多少分區

hive> show partitions dept_partition;

8．查看分區表結構

hive> desc formatted dept_partition;

# Partition Information

# col_name data_type comment

month string

4.6.2 分區表注意事項

1．建立二級分區表

hive (default)> create table dept_partition2(

deptno int, dname string, loc string

)

partitioned by (month string, day string)

row format delimited fields terminated by '\t';

2．正常的加載數據

（1）加載數據到二級分區表中

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table

default.dept_partition2 partition(month='201709', day='13');

（2）查詢分區數據

hive (default)> select * from dept_partition2 where month='201709' and day='13';

3．把數據直接上傳到分區目錄上，讓分區表和數據產生關聯的三種方式

（1）方式一：上傳數據後修復

上傳數據

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=12;

hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;

查詢數據（查詢不到剛上傳的數據）

hive (default)> select * from dept_partition2 where month='201709' and day='12';

執行修復命令

hive> msck repair table dept_partition2;

再次查詢數據

hive (default)> select * from dept_partition2 where month='201709' and day='12';

（2）方式二：上傳數據後添加分區

上傳數據

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=11;

hive (default)> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;

執行添加分區

hive (default)> alter table dept_partition2 add partition(month='201709',

day='11');

查詢數據

hive (default)> select * from dept_partition2 where month='201709' and day='11';

（3）方式三：建立文件夾後load數據到分區

建立目錄

hive (default)> dfs -mkdir -p

/user/hive/warehouse/dept_partition2/month=201709/day=10;

上傳數據

hive (default)> load data local inpath '/opt/module/datas/dept.txt' into table

dept_partition2 partition(month='201709',day='10');

查詢數據

hive (default)> select * from dept_partition2 where month='201709' and day='10';

4.7 修改表

4.7.1 重命名錶

1．語法

ALTER TABLE table_name RENAME TO new_table_name

2．實操案例

hive (default)> alter table dept_partition2 rename to dept_partition3;

4.7.2 增長、修改和刪除表分區

詳見4.6.1分區表基本操做。

4.7.3 增長/修改/替換列信息

1．語法

更新列

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

增長和替換列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

注：ADD是表明新增一字段，字段位置在全部列後面(partition列前)，REPLACE則是表示替換表中全部字段。

2．實操案例

（1）查詢表結構

hive> desc dept_partition;

（2）添加列

hive (default)> alter table dept_partition add columns(deptdesc string);

（3）查詢表結構

hive> desc dept_partition;

（4）更新列

hive (default)> alter table dept_partition change column deptdesc desc int;

（5）查詢表結構

hive> desc dept_partition;

（6）替換列

hive (default)> alter table dept_partition replace columns(deptno string, dname

string, loc string);

（7）查詢表結構

hive> desc dept_partition;

4.8 刪除表

hive (default)> drop table dept_partition;

第5章 DML數據操做

5.1 數據導入

5.1.1 向表中裝載數據（Load）

1．語法

hive> load data [local] inpath '/opt/module/datas/student.txt' overwrite | into table student [partition (partcol1=val1,…)];

（1）load data:表示加載數據

（2）local:表示從本地加載數據到hive表；不然從HDFS加載數據到hive表

（3）inpath:表示加載數據的路徑

（4）overwrite:表示覆蓋表中已有數據，不然表示追加

（5）into table:表示加載到哪張表

（6）student:表示具體的表

（7）partition:表示上傳到指定分區

2．實操案例

（0）建立一張表

hive (default)> create table student(id string, name string) row format delimited fields terminated by '\t';

（1）加載本地文件到hive

hive (default)> load data local inpath '/opt/module/datas/student.txt' into table default.student;

（2）加載HDFS文件到hive中

上傳文件到HDFS

hive (default)> dfs -put /opt/module/datas/student.txt /user/atguigu/hive;

加載HDFS上數據

hive (default)> load data inpath '/user/atguigu/hive/student.txt' into table default.student;

（3）加載數據覆蓋表中已有的數據

上傳文件到HDFS

hive (default)> dfs -put /opt/module/datas/student.txt /user/atguigu/hive;

加載數據覆蓋表中已有的數據

hive (default)> load data inpath '/user/atguigu/hive/student.txt' overwrite into table default.student;

5.1.2 經過查詢語句向表中插入數據（Insert）

1．建立一張分區表

hive (default)> create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';

2．基本插入數據

hive (default)> insert into table student partition(month='201709') values(1,'wangwu');

3．基本模式插入（根據單張表查詢結果）

hive (default)> insert overwrite table student partition(month='201708')

select id, name from student where month='201709';

4．多插入模式（根據多張表查詢結果）

hive (default)> from student

insert overwrite table student partition(month='201707')

select id, name where month='201709'

insert overwrite table student partition(month='201706')

select id, name where month='201709';

5.1.3 查詢語句中建立表並加載數據（As Select）

詳見4.5.1章建立表。

根據查詢結果建立表（查詢的結果會添加到新建立的表中）

create table if not exists student3

as select id, name from student;

5.1.4 建立表時經過Location指定加載數據路徑

1．建立表，並指定在hdfs上的位置

hive (default)> create table if not exists student5(

id int, name string

)

row format delimited fields terminated by '\t'

location '/user/hive/warehouse/student5';

2．上傳數據到hdfs上

hive (default)> dfs -put /opt/module/datas/student.txt

/user/hive/warehouse/student5;

3．查詢數據

hive (default)> select * from student5;

5.1.5 Import數據到指定Hive表中

注意：先用export導出後，再將數據導入。

hive (default)> import table student2 partition(month='201709') from

'/user/hive/warehouse/export/student';

5.2 數據導出

5.2.1 Insert導出

1．將查詢的結果導出到本地

hive (default)> insert overwrite local directory '/opt/module/datas/export/student'

select * from student;

2．將查詢的結果格式化導出到本地

hive(default)>insert overwrite local directory '/opt/module/datas/export/student1'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;

3．將查詢的結果導出到HDFS上(沒有local)

hive (default)> insert overwrite directory '/user/atguigu/student2'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

select * from student;

5.2.2 Hadoop命令導出到本地

hive (default)> dfs -get /user/hive/warehouse/student/month=201709/000000_0

/opt/module/datas/export/student3.txt;

5.2.3 Hive Shell 命令導出

基本語法：（hive -f/-e 執行語句或者腳本 > file）

[atguigu@hadoop102 hive]$ bin/hive -e 'select * from default.student;' >

/opt/module/datas/export/student4.txt;

5.2.4 Export導出到HDFS上

(defahiveult)> export table default.student to

'/user/hive/warehouse/export/student';

5.2.5 Sqoop導出

後續課程專門講。

5.3 清除表中數據（Truncate）

注意：Truncate只能刪除管理表，不能刪除外部表中數據

hive (default)> truncate table student;

第6章查詢

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

查詢語句語法：

[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available

starting with Hive 0.13.0)

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[ORDER BY col_list]

[CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY col_list]

]

[LIMIT number]

6.1 基本查詢（Select…From）

6.1.1 全表和特定列查詢

1．全表查詢

hive (default)> select * from emp;

2．選擇特定列查詢

hive (default)> select empno, ename from emp;

注意：

（1）SQL 語言大小寫不敏感。

（2）SQL 能夠寫在一行或者多行

（3）關鍵字不能被縮寫也不能分行

（4）各子句通常要分行寫。

（5）使用縮進提升語句的可讀性。

6.1.2 列別名

1．重命名一個列

2．便於計算

3．緊跟列名，也能夠在列名和別名之間加入關鍵字'AS'

4．案例實操

查詢名稱和部門

hive (default)> select ename AS name, deptno dn from emp;

6.1.3 算術運算符

表6-3

運算符	描述
A+B	A和B 相加
A-B	A減去B
A*B	A和B 相乘
A/B	A除以B
A%B	A對B取餘
A&B	A和B按位取與
A\|B	A和B按位取或
A^B	A和B按位取異或
~A	A按位取反

案例實操

查詢出全部員工的薪水後加1顯示。

hive (default)> select sal +1 from emp;

6.1.4 經常使用函數

1．求總行數（count）

hive (default)> select count(*) cnt from emp;

2．求工資的最大值（max）

hive (default)> select max(sal) max_sal from emp;

3．求工資的最小值（min）

hive (default)> select min(sal) min_sal from emp;

4．求工資的總和（sum）

hive (default)> select sum(sal) sum_sal from emp;

5．求工資的平均值（avg）

hive (default)> select avg(sal) avg_sal from emp;

6.1.5 Limit語句

典型的查詢會返回多行數據。LIMIT子句用於限制返回的行數。

hive (default)> select * from emp limit 5;

6.2 Where語句

1．使用WHERE子句，將不知足條件的行過濾掉

2．WHERE子句緊隨FROM子句

3．案例實操

查詢出薪水大於1000的全部員工

hive (default)> select * from emp where sal >1000;

6.2.1 比較運算符（Between/In/ Is Null）

1）下面表中描述了謂詞操做符，這些操做符一樣能夠用於JOIN…ON和HAVING語句中。

表6-4

操做符	支持的數據類型	描述
A=B	基本數據類型	若是A等於B則返回TRUE，反之返回FALSE
A<=>B	基本數據類型	若是A和B都爲NULL，則返回TRUE，其餘的和等號（=）操做符的結果一致，若是任一爲NULL則結果爲NULL
A<>B, A!=B	基本數據類型	A或者B爲NULL則返回NULL；若是A不等於B，則返回TRUE，反之返回FALSE
A<B	基本數據類型	A或者B爲NULL，則返回NULL；若是A小於B，則返回TRUE，反之返回FALSE
A<=B	基本數據類型	A或者B爲NULL，則返回NULL；若是A小於等於B，則返回TRUE，反之返回FALSE
A>B	基本數據類型	A或者B爲NULL，則返回NULL；若是A大於B，則返回TRUE，反之返回FALSE
A>=B	基本數據類型	A或者B爲NULL，則返回NULL；若是A大於等於B，則返回TRUE，反之返回FALSE
A [NOT] BETWEEN B AND C	基本數據類型	若是A，B或者C任一爲NULL，則結果爲NULL。若是A的值大於等於B並且小於或等於C，則結果爲TRUE，反之爲FALSE。若是使用NOT關鍵字則可達到相反的效果。
A IS NULL	全部數據類型	若是A等於NULL，則返回TRUE，反之返回FALSE
A IS NOT NULL	全部數據類型	若是A不等於NULL，則返回TRUE，反之返回FALSE
IN(數值1, 數值2)	全部數據類型	使用 IN運算顯示列表中的值
A [NOT] LIKE B	STRING 類型	B是一個SQL下的簡單正則表達式，若是A與其匹配的話，則返回TRUE；反之返回FALSE。B的表達式說明以下：'x%'表示A必須以字母'x'開頭，'%x'表示A必須以字母'x'結尾，而'%x%'表示A包含有字母'x',能夠位於開頭，結尾或者字符串中間。若是使用NOT關鍵字則可達到相反的效果。
A RLIKE B, A REGEXP B	STRING 類型	B是一個正則表達式，若是A與其匹配，則返回TRUE；反之返回FALSE。匹配使用的是JDK中的正則表達式接口實現的，由於正則也依據其中的規則。例如，正則表達式必須和整個字符串A相匹配，而不是隻需與其字符串匹配。

2）案例實操

（1）查詢出薪水等於5000的全部員工

hive (default)> select * from emp where sal =5000;

（2）查詢工資在500到1000的員工信息

hive (default)> select * from emp where sal between 500 and 1000;

（3）查詢comm爲空的全部員工信息

hive (default)> select * from emp where comm is null;

（4）查詢工資是1500或5000的員工信息

hive (default)> select * from emp where sal IN (1500, 5000);

6.2.2 Like和RLike

1）使用LIKE運算選擇相似的值

2）選擇條件能夠包含字符或數字:

% 表明零個或多個字符(任意個字符)。

_ 表明一個字符。

3）RLIKE子句是Hive中這個功能的一個擴展，其能夠經過Java的正則表達式這個更強大的語言來指定匹配條件。

4）案例實操

（1）查找以2開頭薪水的員工信息

hive (default)> select * from emp where sal LIKE '2%';

（2）查找第二個數值爲2的薪水的員工信息

hive (default)> select * from emp where sal LIKE '_2%';

（3）查找薪水中含有2的員工信息

hive (default)> select * from emp where sal RLIKE '[2]';

6.2.3 邏輯運算符（And/Or/Not）

表6-5

操做符	含義
AND	邏輯並
OR	邏輯或
NOT	邏輯否

案例實操

（1）查詢薪水大於1000，部門是30

hive (default)> select * from emp where sal>1000 and deptno=30;

（2）查詢薪水大於1000，或者部門是30

hive (default)> select * from emp where sal>1000 or deptno=30;

（3）查詢除了20部門和30部門之外的員工信息

hive (default)> select * from emp where deptno not IN(30, 20);

6.3 分組

6.3.1 Group By語句

GROUP BY語句一般會和聚合函數一塊兒使用，按照一個或者多個列隊結果進行分組，而後對每一個組執行聚合操做。

案例實操：

（1）計算emp表每一個部門的平均工資

hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

（2）計算emp每一個部門中每一個崗位的最高薪水

hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by

t.deptno, t.job;

6.3.2 Having語句

1．having與where不一樣點

（1）where針對表中的列發揮做用，查詢數據；having針對查詢結果中的列發揮做用，篩選數據。

（2）where後面不能寫分組函數，而having後面可使用分組函數。

（3）having只用於group by分組統計語句。

2．案例實操

（1）求每一個部門的平均薪水大於2000的部門

求每一個部門的平均工資

hive (default)> select deptno, avg(sal) from emp group by deptno;

求每一個部門的平均薪水大於2000的部門

hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno having

avg_sal > 2000;

6.4 Join語句

6.4.1 等值Join

Hive支持一般的SQL JOIN語句，可是只支持等值鏈接，不支持非等值鏈接。

案例實操

（1）根據員工表和部門表中的部門編號相等，查詢員工編號、員工名稱和部門名稱；

hive (default)> select e.empno, e.ename, d.deptno, d.dname from emp e join dept d

on e.deptno = d.deptno;

6.4.2 表的別名

1．好處

（1）使用別名能夠簡化查詢。

（2）使用表名前綴能夠提升執行效率。

2．案例實操

合併員工表和部門表

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno

= d.deptno;

6.4.3 內鏈接

內鏈接：只有進行鏈接的兩個表中都存在與鏈接條件相匹配的數據纔會被保留下來。

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno

= d.deptno;

6.4.4 左外鏈接

左外鏈接：JOIN操做符左邊表中符合WHERE子句的全部記錄將會被返回。

hive (default)> select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno

= d.deptno;

6.4.5 右外鏈接

右外鏈接：JOIN操做符右邊表中符合WHERE子句的全部記錄將會被返回。

hive (default)> select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno

= d.deptno;

6.4.6 滿外鏈接

滿外鏈接：將會返回全部表中符合WHERE語句條件的全部記錄。若是任一表的指定字段沒有符合條件的值的話，那麼就使用NULL值替代。

hive (default)> select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno

= d.deptno;

6.4.7 多表鏈接

注意：鏈接 n個表，至少須要n-1個鏈接條件。例如：鏈接三個表，至少須要兩個鏈接條件。

數據準備

1．建立位置表

create table if not exists default.location(

loc int,

loc_name string

)

row format delimited fields terminated by '\t';

2．導入數據

hive (default)> load data local inpath '/opt/module/datas/location.txt' into table default.location;

3．多表鏈接查詢

hive (default)>SELECT e.ename, d.deptno, l. loc_name

FROM emp e

JOIN dept d

ON d.deptno = e.deptno

JOIN location l

ON d.loc = l.loc;

大多數狀況下，Hive會對每對JOIN鏈接對象啓動一個MapReduce任務。本例中會首先啓動一個MapReduce job對錶e和表d進行鏈接操做，而後會再啓動一個MapReduce job將第一個MapReduce job的輸出和表l;進行鏈接操做。

注意：爲何不是表d和表l先進行鏈接操做呢？這是由於Hive老是按照從左到右的順序執行的。

6.4.8 笛卡爾積

1．笛卡爾集會在下面條件下產生

（1）省略鏈接條件

（2）鏈接條件無效

（3）全部表中的全部行互相鏈接

2．案例實操

hive (default)> select empno, dname from emp, dept;

6.4.9 鏈接謂詞中不支持or

hive (default)> select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno

= d.deptno or e.ename=d.ename; 錯誤的

6.5 排序

6.5.1 全局排序（Order By）

Order By：全局排序，一個Reducer

1．使用 ORDER BY 子句排序

ASC（ascend）: 升序（默認）

DESC（descend）: 降序

2．ORDER BY 子句在SELECT語句的結尾

3．案例實操

（1）查詢員工信息按工資升序排列

hive (default)> select * from emp order by sal;

（2）查詢員工信息按工資降序排列

hive (default)> select * from emp order by sal desc;

6.5.2 按照別名排序

按照員工薪水的2倍排序

hive (default)> select ename, sal*2 twosal from emp order by twosal;

6.5.3 多個列排序

按照部門和工資升序排序

hive (default)> select ename, deptno, sal from emp order by deptno, sal ;

6.5.4 每一個MapReduce內部排序（Sort By）

Sort By：每一個Reducer內部進行排序，對全局結果集來講不是排序。

1．設置reduce個數

hive (default)> set mapreduce.job.reduces=3;

2．查看設置reduce個數

hive (default)> set mapreduce.job.reduces;

3．根據部門編號降序查看員工信息

hive (default)> select * from emp sort by empno desc;

4．將查詢結果導入到文件中（按照部門編號降序排序）

hive (default)> insert overwrite local directory '/opt/module/datas/sortby-result'

select * from emp sort by deptno desc;

6.5.5 分區排序（Distribute By）

Distribute By：相似MR中partition，進行分區，結合sort by使用。

注意，Hive要求DISTRIBUTE BY語句要寫在SORT BY語句以前。

對於distribute by進行測試，必定要分配多reduce進行處理，不然沒法看到distribute by的效果。

案例實操：

（1）先按照部門編號分區，再按照員工編號降序排序。

hive (default)> set mapreduce.job.reduces=3;

hive (default)> insert overwrite local directory '/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

6.5.6 Cluster By

當distribute by和sorts by字段相同時，可使用cluster by方式。

cluster by除了具備distribute by的功能外還兼具sort by的功能。可是排序只能是升序排序，不能指定排序規則爲ASC或者DESC。

1）如下兩種寫法等價

hive (default)> select * from emp cluster by deptno;

hive (default)> select * from emp distribute by deptno sort by deptno;

注意：按照部門編號分區，不必定就是固定死的數值，能夠是20號和30號部門分到一個分區裏面去。

6.6 分桶及抽樣查詢

6.6.1 分桶表數據存儲

分區針對的是數據的存儲路徑；分桶針對的是數據文件。

分區提供一個隔離數據和優化查詢的便利方式。不過，並不是全部的數據集均可造成合理的分區，特別是以前所提到過的要肯定合適的劃分大小這個疑慮。

分桶是將數據集分解成更容易管理的若干部分的另外一個技術。

1．先建立分桶表，經過直接導入數據文件的方式

（1）數據準備

（2）建立分桶表

create table stu_buck(id int, name string)

clustered by(id)

into 4 buckets

row format delimited fields terminated by '\t';

（3）查看錶結構

hive (default)> desc formatted stu_buck;

Num Buckets: 4

（4）導入數據到分桶表中

hive (default)> load data local inpath '/opt/module/datas/student.txt' into table

stu_buck;

（5）查看建立的分桶表中是否分紅4個桶，如圖6-7所示

圖6-7 未分桶

發現並無分紅4個桶。是什麼緣由呢？

2．建立分桶表時，數據經過子查詢的方式導入

（1）先建一個普通的stu表

create table stu(id int, name string)

row format delimited fields terminated by '\t';

（2）向普通的stu表中導入數據

load data local inpath '/opt/module/datas/student.txt' into table stu;

（3）清空stu_buck表中數據

truncate table stu_buck;

select * from stu_buck;

（4）導入數據到分桶表，經過子查詢的方式

insert into table stu_buck

select id, name from stu;

（5）發現仍是隻有一個分桶，如圖6-8所示

圖6-8 未分桶

（6）須要設置一個屬性

hive (default)> set hive.enforce.bucketing=true;

hive (default)> set mapreduce.job.reduces=-1;

hive (default)> insert into table stu_buck

select id, name from stu;

圖6-9 分桶

（7）查詢分桶的數據

hive (default)> select * from stu_buck;

stu_buck.id stu_buck.name

1004 ss4

1008 ss8

1012 ss12

1016 ss16

1001 ss1

1005 ss5

1009 ss9

1013 ss13

1002 ss2

1006 ss6

1010 ss10

1014 ss14

1003 ss3

1007 ss7

1011 ss11

1015 ss15

6.6.2 分桶抽樣查詢

對於很是大的數據集，有時用戶須要使用的是一個具備表明性的查詢結果而不是所有結果。Hive能夠經過對錶進行抽樣來知足這個需求。

查詢表stu_buck中的數據。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注：tablesample是抽樣語句，語法：TABLESAMPLE(BUCKET x OUT OF y) 。

y必須是table總bucket數的倍數或者因子。hive根據y的大小，決定抽樣的比例。例如，table總共分了4份，當y=2時，抽取(4/2=)2個bucket的數據，當y=8時，抽取(4/8=)1/2個bucket的數據。

x表示從哪一個bucket開始抽取，若是須要取多個分區，之後的分區號爲當前分區號加上y。例如，table總bucket數爲4，tablesample(bucket 1 out of 2)，表示總共抽取（4/2=）2個bucket的數據，抽取第1(x)個和第3(x+y)個bucket的數據。

注意：x的值必須小於等於y的值，不然

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

6.7 其餘經常使用查詢函數

6.7.1 空字段賦值

函數說明

NVL：給值爲NULL的數據賦值，它的格式是NVL( string1, replace_with)。它的功能是若是string1爲NULL，則NVL函數返回replace_with的值，不然返回string1的值，若是兩個參數都爲NULL ，則返回NULL。

數據準備：採用員工表
查詢：若是員工的 comm 爲 NULL ，則用 -1 代替

hive (default)> select nvl(comm,-1) from emp;

_c0

20.0

300.0

500.0

-1.0

1400.0

-1.0

0.0

-1.0

查詢：若是員工的 comm 爲 NULL ，則用領導 id 代替

hive (default)> select nvl(comm,mgr) from emp;

_c0

20.0

300.0

500.0

7839.0

1400.0

7839.0

7566.0

NULL

0.0

7788.0

7698.0

7566.0

6.7.2 CASE WHEN

1. 數據準備

name	dept_id	sex
悟空	A	男
大海	A	男
宋宋	B	男
鳳姐	A	女
婷姐	B	女
婷婷	B	女

2．需求

求出不一樣部門男女各多少人。結果以下：

A 2 1

B 1 2

3．建立本地emp_sex.txt，導入數據

[atguigu@hadoop102 datas]$ vi emp_sex.txt

悟空 A 男

大海 A 男

宋宋 B 男

鳳姐 A 女

婷姐 B 女

婷婷 B 女

4．建立hive表並導入數據

create table emp_sex(

name string,

dept_id string,

sex string)

row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/emp_sex.txt' into table emp_sex;

5．按需求查詢數據

select

dept_id,

sum(case sex when '男' then 1 else 0 end) male_count,

sum(case sex when '女' then 1 else 0 end) female_count

from

emp_sex

group by

dept_id;

6.7.2 行轉列

1．相關函數說明

CONCAT(string A/col, string B/col…)：返回輸入字符串鏈接後的結果，支持任意個輸入字符串;

CONCAT_WS(separator, str1, str2,...)：它是一個特殊形式的 CONCAT()。第一個參數剩餘參數間的分隔符。分隔符能夠是與剩餘參數同樣的字符串。若是分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過度隔符參數後的任何 NULL 和空字符串。分隔符將被加到被鏈接的字符串之間;

COLLECT_SET(col)：函數只接受基本數據類型，它的主要做用是將某字段的值進行去重彙總，產生array類型字段。

2．數據準備

表6-6 數據準備

name	constellation	blood_type
孫悟空	白羊座	A
大海	射手座	A
宋宋	白羊座	B
豬八戒	白羊座	A
鳳姐	射手座	A

3．需求

把星座和血型同樣的人歸類到一塊兒。結果以下：

射手座,A 大海|鳳姐

白羊座,A 孫悟空|豬八戒

白羊座,B 宋宋

4．建立本地constellation.txt，導入數據

[atguigu@hadoop102 datas]$ vi constellation.txt

孫悟空白羊座 A

大海射手座 A

宋宋白羊座 B

豬八戒白羊座 A

鳳姐射手座 A

5．建立hive表並導入數據

create table person_info(

name string,

constellation string,

blood_type string)

row format delimited fields terminated by "\t";

load data local inpath "/opt/module/datas/person_info.txt" into table person_info;

6．按需求查詢數據

select

t1.base,

concat_ws('|', collect_set(t1.name)) name

from

(select

name,

concat(constellation, ",", blood_type) base

from

person_info) t1

group by

t1.base;

6.7.3 列轉行

1．函數說明

EXPLODE(col)：將hive一列中複雜的array或者map結構拆分紅多行。

LATERAL VIEW

用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解釋：用於和split, explode等UDTF一塊兒使用，它可以將一列數據拆成多行數據，在此基礎上能夠對拆分後的數據進行聚合。

2．數據準備

表6-7 數據準備

movie	category
《疑犯追蹤》	懸疑,動做,科幻,劇情
《Lie to me》	懸疑,警匪,動做,心理,劇情
《戰狼2》	戰爭,動做,災難

3．需求

將電影分類中的數組數據展開。結果以下：

《疑犯追蹤》懸疑

《疑犯追蹤》動做

《疑犯追蹤》科幻

《疑犯追蹤》劇情

《Lie to me》懸疑

《Lie to me》警匪

《Lie to me》動做

《Lie to me》心理

《Lie to me》劇情

《戰狼2》戰爭

《戰狼2》動做

《戰狼2》災難

4．建立本地movie.txt，導入數據

[atguigu@hadoop102 datas]$ vi movie.txt

《疑犯追蹤》懸疑,動做,科幻,劇情

《Lie to me》懸疑,警匪,動做,心理,劇情

《戰狼2》戰爭,動做,災難

5．建立hive表並導入數據

create table movie_info(

movie string,

category array<string>)

row format delimited fields terminated by "\t"

collection items terminated by ",";

load data local inpath "/opt/module/datas/movie.txt" into table movie_info;

6．按需求查詢數據

select

movie,

category_name

from

movie_info lateral view explode(category) table_tmp as category_name;

6.7.4 窗口函數

1．相關函數說明

OVER()：指定分析函數工做的數據窗口大小，這個數據窗口大小可能會隨着行的變而變化

CURRENT ROW：當前行

n PRECEDING：往前n行數據

n FOLLOWING：日後n行數據

UNBOUNDED：起點，UNBOUNDED PRECEDING 表示從前面的起點， UNBOUNDED FOLLOWING表示到後面的終點

LAG(col,n)：往前第n行數據

LEAD(col,n)：日後第n行數據

NTILE(n)：把有序分區中的行分發到指定數據的組中，各個組有編號，編號從1開始，對於每一行，NTILE返回此行所屬的組的編號。注意：n必須爲int類型。

2．數據準備：name，orderdate，cost

jack,2017-01-01,10

tony,2017-01-02,15

jack,2017-02-03,23

tony,2017-01-04,29

jack,2017-01-05,46

jack,2017-04-06,42

tony,2017-01-07,50

jack,2017-01-08,55

mart,2017-04-08,62

mart,2017-04-09,68

neil,2017-05-10,12

mart,2017-04-11,75

neil,2017-06-12,80

mart,2017-04-13,94

3．需求

查詢在2017年4月份購買過的顧客及總人數
查詢顧客的購買明細及月購買總額
上述的場景,要將cost按照日期進行累加
查詢顧客上次的購買時間
查詢前20%時間的訂單信息

4．建立本地business.txt，導入數據

[atguigu@hadoop102 datas]$ vi business.txt

5．建立hive表並導入數據

create table business(

name string,

orderdate string,

cost int

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/opt/module/datas/business.txt" into table business;

6．按需求查詢數據

查詢在2017年4月份購買過的顧客及總人數

select name,count(*) over ()
from business
where substring(orderdate,1,7) = '2017-04'
group by name;
查詢顧客的購買明細及月購買總額

select name,orderdate,cost,sum(cost) over(partition by month(orderdate)) from
business;

上述的場景,要將cost按照日期進行累加

select name,orderdate,cost,

sum(cost) over() as sample1,--全部行相加

sum(cost) over(partition by name) as sample2,--按name分組，組內數據相加

sum(cost) over(partition by name order by orderdate) as sample3,--按name分組，組內數據累加

sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3同樣,由起點到當前行的聚合

sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --當前行和前面一行作聚合

sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--當前行和前邊一行及後面一行

sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --當前行及後面全部行

from business;

查看顧客上次的購買時間

select name,orderdate,cost,

lag(orderdate,1,'1900-01-01') over(partition by name order by orderdate ) as time1, lag(orderdate,2) over (partition by name order by orderdate) as time2

from business;

查詢前20%時間的訂單信息

select * from (

select name,orderdate,cost, ntile(5) over(order by orderdate) sorted

from business

) t

where sorted = 1;

6.7.5 Rank

1．函數說明

RANK() 排序相同時會重複，總數不會變

DENSE_RANK() 排序相同時會重複，總數會減小

ROW_NUMBER() 會根據順序計算

2．數據準備

表6-7 數據準備

name	subject	score
孫悟空	語文	87
孫悟空	數學	95
孫悟空	英語	68
大海	語文	94
大海	數學	56
大海	英語	84
宋宋	語文	64
宋宋	數學	86
宋宋	英語	84
婷婷	語文	65
婷婷	數學	85
婷婷	英語	78

3．需求

計算每門學科成績排名。

4．建立本地movie.txt，導入數據

[atguigu@hadoop102 datas]$ vi score.txt

5．建立hive表並導入數據

create table score(

name string,

subject string,

score int)

row format delimited fields terminated by "\t";

load data local inpath '/opt/module/datas/score.txt' into table score;

6．按需求查詢數據

select name,

subject,

score,

rank() over(partition by subject order by score desc) rp,

dense_rank() over(partition by subject order by score desc) drp,

row_number() over(partition by subject order by score desc) rmp

from score;

name subject score rp drp rmp

孫悟空數學 95 1 1 1

宋宋數學 86 2 2 2

婷婷數學 85 3 3 3

大海數學 56 4 4 4

宋宋英語 84 1 1 1

大海英語 84 1 1 2

婷婷英語 78 3 2 3

孫悟空英語 68 4 3 4

大海語文 94 1 1 1

孫悟空語文 87 2 2 2

婷婷語文 65 3 3 3

宋宋語文 64 4 4 4

第7章函數

7.1 系統內置函數

1．查看系統自帶的函數

hive> show functions;

2．顯示自帶的函數的用法

hive> desc function upper;

3．詳細顯示自帶的函數的用法

hive> desc function extended upper;

7.2 自定義函數

1）Hive 自帶了一些函數，好比：max/min等，可是數量有限，本身能夠經過自定義UDF來方便的擴展。

2）當Hive提供的內置函數沒法知足你的業務處理須要時，此時就能夠考慮使用用戶自定義函數（UDF：user-defined function）。

3）根據用戶自定義函數類別分爲如下三種：

（1）UDF（User-Defined-Function）

一進一出

（2）UDAF（User-Defined Aggregation Function）

彙集函數，多進一出

相似於：count/max/min

（3）UDTF（User-Defined Table-Generating Functions）

一進多出

如lateral view explore()

4）官方文檔地址

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

5）編程步驟：

（1）繼承org.apache.hadoop.hive.ql.UDF

（2）須要實現evaluate函數；evaluate函數支持重載；

（3）在hive的命令行窗口建立函數

a）添加jar

add jar linux_jar_path

b）建立function，

create [temporary] function [dbname.]function_name AS class_name;

（4）在hive的命令行窗口刪除函數

Drop [temporary] function [if exists] [dbname.]function_name;

6）注意事項

（1）UDF必需要有返回類型，能夠返回null，可是返回類型不能爲void；

7.3 自定義UDF函數

1．建立一個Maven工程Hive

2．導入依賴

<groupId>org.apache.hive</groupId>

</dependency>

</dependencies>

3．建立一個類

package com.atguigu.hive;

import org.apache.hadoop.hive.ql.exec.UDF;

public class Lower extends UDF {

public String evaluate (final String s) {

if (s == null) {

return null;

}

return s.toLowerCase();

}

4．打成jar包上傳到服務器/opt/module/jars/udf.jar

5．將jar包添加到hive的classpath

hive (default)> add jar /opt/module/datas/udf.jar;

6．建立臨時函數與開發好的java class關聯

hive (default)> create temporary function mylower as "com.atguigu.hive.Lower";

7．便可在hql中使用自定義的函數strip

hive (default)> select ename, mylower(ename) lowername from emp;

第8章壓縮和存儲

8.1 Hadoop源碼編譯支持Snappy壓縮

8.1.1 資源準備

1．CentOS聯網

配置CentOS能鏈接外網。Linux虛擬機ping www.baidu.com 是暢通的

注意：採用root角色編譯，減小文件夾權限出現問題

2．jar包準備(hadoop源碼、JDK8 、maven、protobuf)

（1）hadoop-2.7.2-src.tar.gz

（2）jdk-8u144-linux-x64.tar.gz

（3）snappy-1.1.3.tar.gz

（4）apache-maven-3.0.5-bin.tar.gz

（5）protobuf-2.5.0.tar.gz

8.1.2 jar包安裝

注意：全部操做必須在root用戶下完成

1．JDK解壓、配置環境變量JAVA_HOME和PATH，驗證java-version(以下都須要驗證是否配置成功)

[root@hadoop101 software] # tar -zxf jdk-8u144-linux-x64.tar.gz -C /opt/module/

[root@hadoop101 software]# vi /etc/profile

#JAVA_HOME

export JAVA_HOME=/opt/module/jdk1.8.0_144

export PATH=$PATH:$JAVA_HOME/bin

[root@hadoop101 software]#source /etc/profile

驗證命令：java -version

2．Maven解壓、配置 MAVEN_HOME和PATH

[root@hadoop101 software]# tar -zxvf apache-maven-3.0.5-bin.tar.gz -C /opt/module/

[root@hadoop101 apache-maven-3.0.5]# vi /etc/profile

#MAVEN_HOME

export MAVEN_HOME=/opt/module/apache-maven-3.0.5

export PATH=$PATH:$MAVEN_HOME/bin

[root@hadoop101 software]#source /etc/profile

驗證命令：mvn -version

8.1.3 編譯源碼

1．準備編譯環境

[root@hadoop101 software]# yum install svn

[root@hadoop101 software]# yum install autoconf automake libtool cmake

[root@hadoop101 software]# yum install ncurses-devel

[root@hadoop101 software]# yum install openssl-devel

[root@hadoop101 software]# yum install gcc*

2．編譯安裝snappy

[root@hadoop101 software]# tar -zxvf snappy-1.1.3.tar.gz -C /opt/module/

[root@hadoop101 module]# cd snappy-1.1.3/

[root@hadoop101 snappy-1.1.3]# ./configure

[root@hadoop101 snappy-1.1.3]# make

[root@hadoop101 snappy-1.1.3]# make install

# 查看snappy庫文件

[root@hadoop101 snappy-1.1.3]# ls -lh /usr/local/lib |grep snappy

3．編譯安裝protobuf

[root@hadoop101 software]# tar -zxvf protobuf-2.5.0.tar.gz -C /opt/module/

[root@hadoop101 module]# cd protobuf-2.5.0/

[root@hadoop101 protobuf-2.5.0]# ./configure

[root@hadoop101 protobuf-2.5.0]# make

[root@hadoop101 protobuf-2.5.0]# make install

# 查看protobuf版本以測試是否安裝成功
[root@hadoop101 protobuf-2.5.0]# protoc --version

4．編譯hadoop native

[root@hadoop101 software]# tar -zxvf hadoop-2.7.2-src.tar.gz

[root@hadoop101 software]# cd hadoop-2.7.2-src/

[root@hadoop101 software]# mvn clean package -DskipTests -Pdist,native -Dtar -Dsnappy.lib=/usr/local/lib -Dbundle.snappy

執行成功後，/opt/software/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz即爲新生成的支持snappy壓縮的二進制安裝包。

8.2 Hadoop壓縮配置

8.2.1 MR支持的壓縮編碼

表6-8

壓縮格式	工具	算法	文件擴展名	是否可切分
DEFAULT	無	DEFAULT	.deflate	否
Gzip	gzip	DEFAULT	.gz	否
bzip2	bzip2	bzip2	.bz2	是
LZO	lzop	LZO	.lzo	是
Snappy	無	Snappy	.snappy	否

爲了支持多種壓縮/解壓縮算法，Hadoop引入了編碼/解碼器，以下表所示：

表6-9

壓縮格式	對應的編碼/解碼器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

壓縮性能的比較：

表6-10

壓縮算法	原始文件大小	壓縮文件大小	壓縮速度	解壓速度
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

http://google.github.io/snappy/

On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

8.2.2 壓縮參數配置

要在Hadoop中啓用壓縮，能夠配置以下參數（mapred-site.xml文件中）：

表6-11

參數	默認值	階段	建議
io.compression.codecs （在core-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec	輸入壓縮	Hadoop使用文件擴展名判斷是否支持某種編解碼器
mapreduce.map.output.compress	false	mapper輸出	這個參數設爲true啓用壓縮
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	mapper輸出	使用LZO、LZ4或snappy編解碼器在此階段壓縮數據
mapreduce.output.fileoutputformat.compress	false	reducer輸出	這個參數設爲true啓用壓縮
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	reducer輸出	使用標準工具或者編解碼器，如gzip和bzip2
mapreduce.output.fileoutputformat.compress.type	RECORD	reducer輸出	SequenceFile輸出使用的壓縮類型：NONE和BLOCK

8.3 開啓Map輸出階段壓縮

開啓map輸出階段壓縮能夠減小job中map和Reduce task間數據傳輸量。具體配置以下：

案例實操：

1．開啓hive中間傳輸數據壓縮功能

hive (default)>set hive.exec.compress.intermediate=true;

2．開啓mapreduce中map輸出壓縮功能

hive (default)>set mapreduce.map.output.compress=true;

3．設置mapreduce中map輸出數據的壓縮方式

hive (default)>set mapreduce.map.output.compress.codec=

org.apache.hadoop.io.compress.SnappyCodec;

4．執行查詢語句

hive (default)> select count(ename) name from emp;

8.4 開啓Reduce輸出階段壓縮

當Hive將輸出寫入到表中時，輸出內容一樣能夠進行壓縮。屬性hive.exec.compress.output控制着這個功能。用戶可能須要保持默認設置文件中的默認值false，這樣默認的輸出就是非壓縮的純文本文件了。用戶能夠經過在查詢語句或執行腳本中設置這個值爲true，來開啓輸出結果壓縮功能。

案例實操：

1．開啓hive最終輸出數據壓縮功能

hive (default)>set hive.exec.compress.output=true;

2．開啓mapreduce最終輸出數據壓縮

hive (default)>set mapreduce.output.fileoutputformat.compress=true;

3．設置mapreduce最終數據輸出壓縮方式

hive (default)> set mapreduce.output.fileoutputformat.compress.codec =

org.apache.hadoop.io.compress.SnappyCodec;

4．設置mapreduce最終數據輸出壓縮爲塊壓縮

hive (default)> set mapreduce.output.fileoutputformat.compress.type=BLOCK;

5．測試一下輸出結果是不是壓縮文件

hive (default)> insert overwrite local directory

'/opt/module/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;

8.5 文件存儲格式

Hive支持的存儲數的格式主要有：TEXTFILE 、SEQUENCEFILE、ORC、PARQUET。

8.5.1 列式存儲和行式存儲

圖6-10 列式存儲和行式存儲

如圖6-10所示左邊爲邏輯表，右邊第一個爲行式存儲，第二個爲列式存儲。

1．行存儲的特色

查詢知足條件的一整行數據的時候，列存儲則須要去每一個彙集的字段找到對應的每一個列的值，行存儲只須要找到其中一個值，其他的值都在相鄰地方，因此此時行存儲查詢的速度更快。

2．列存儲的特色

由於每一個字段的數據彙集存儲，在查詢只須要少數幾個字段的時候，能大大減小讀取的數據量；每一個字段的數據類型必定是相同的，列式存儲能夠針對性的設計更好的設計壓縮算法。

TEXTFILE和SEQUENCEFILE的存儲格式都是基於行存儲的；

ORC和PARQUET是基於列式存儲的。

8.5.2 TextFile格式

默認格式，數據不作壓縮，磁盤開銷大，數據解析開銷大。可結合Gzip、Bzip2使用，但使用Gzip這種方式，hive不會對數據進行切分，從而沒法對數據進行並行操做。

8.5.3 Orc格式

Orc (Optimized Row Columnar)是Hive 0.11版裏引入的新的存儲格式。

如圖6-11所示能夠看到每一個Orc文件由1個或多個stripe組成，每一個stripe250MB大小，這個Stripe實際至關於RowGroup概念，不過大小由4MB->250MB，這樣應該能提高順序讀的吞吐率。每一個Stripe裏有三部分組成，分別是Index Data，Row Data，Stripe Footer：

圖6-11 Orc格式

1）Index Data：一個輕量級的index，默認是每隔1W行作一個索引。這裏作的索引應該只是記錄某行的各字段在Row Data中的offset。

2）Row Data：存的是具體的數據，先取部分行，而後對這些行按列進行存儲。對每一個列進行了編碼，分紅多個Stream來存儲。

3）Stripe Footer：存的是各個Stream的類型，長度等信息。

每一個文件有一個File Footer，這裏面存的是每一個Stripe的行數，每一個Column的數據類型信息等；每一個文件的尾部是一個PostScript，這裏面記錄了整個文件的壓縮類型以及FileFooter的長度信息等。在讀取文件時，會seek到文件尾部讀PostScript，從裏面解析到File Footer長度，再讀FileFooter，從裏面解析到各個Stripe信息，再讀各個Stripe，即從後往前讀。

8.5.4 Parquet格式

Parquet是面向分析型業務的列式存儲格式，由Twitter和Cloudera合做開發，2015年5月從Apache的孵化器裏畢業成爲Apache頂級項目。

Parquet文件是以二進制方式存儲的，因此是不能夠直接讀取的，文件中包括該文件的數據和元數據，所以Parquet格式文件是自解析的。

一般狀況下，在存儲Parquet數據的時候會按照Block大小設置行組的大小，因爲通常狀況下每個Mapper任務處理數據的最小單位是一個Block，這樣能夠把每個行組由一個Mapper任務處理，增大任務執行並行度。Parquet文件的格式如圖6-12所示。

圖6-12 Parquet格式

上圖展現了一個Parquet文件的內容，一個文件中能夠存儲多個行組，文件的首位都是該文件的Magic Code，用於校驗它是不是一個Parquet文件，Footer length記錄了文件元數據的大小，經過該值和文件長度能夠計算出元數據的偏移量，文件的元數據中包括每個行組的元數據信息和該文件存儲數據的Schema信息。除了文件中每個行組的元數據，每一頁的開始都會存儲該頁的元數據，在Parquet中，有三種類型的頁：數據頁、字典頁和索引頁。數據頁用於存儲當前行組中該列的值，字典頁存儲該列值的編碼字典，每個列塊中最多包含一個字典頁，索引頁用來存儲當前行組下該列的索引，目前Parquet中還不支持索引頁。

8.5.5 主流文件存儲格式對比實驗

從存儲文件的壓縮比和查詢速度兩個角度對比。

存儲文件的壓縮比測試：

測試數據

2．TextFile

（1）建立表，存儲數據格式爲TEXTFILE

create table log_text (

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as textfile ;

（2）向表中加載數據

hive (default)> load data local inpath '/opt/module/datas/log.data' into table log_text ;

（3）查看錶中數據大小

hive (default)> dfs -du -h /user/hive/warehouse/log_text;

18.1 M /user/hive/warehouse/log_text/log.data

3．ORC

（1）建立表，存儲數據格式爲ORC

create table log_orc(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as orc ;

（2）向表中加載數據

hive (default)> insert into table log_orc select * from log_text ;

（3）查看錶中數據大小

hive (default)> dfs -du -h /user/hive/warehouse/log_orc/ ;

2.8 M /user/hive/warehouse/log_orc/000000_0

4．Parquet

（1）建立表，存儲數據格式爲parquet

create table log_parquet(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as parquet ;

（2）向表中加載數據

hive (default)> insert into table log_parquet select * from log_text ;

（3）查看錶中數據大小

hive (default)> dfs -du -h /user/hive/warehouse/log_parquet/ ;

13.1 M /user/hive/warehouse/log_parquet/000000_0

存儲文件的壓縮比總結：

ORC > Parquet > textFile

存儲文件的查詢速度測試：

1．TextFile

hive (default)> select count(*) from log_text;

_c0

100000

Time taken: 21.54 seconds, Fetched: 1 row(s)

Time taken: 21.08 seconds, Fetched: 1 row(s)

Time taken: 19.298 seconds, Fetched: 1 row(s)

2．ORC

hive (default)> select count(*) from log_orc;

_c0

100000

Time taken: 20.867 seconds, Fetched: 1 row(s)

Time taken: 22.667 seconds, Fetched: 1 row(s)

Time taken: 18.36 seconds, Fetched: 1 row(s)

3．Parquet

hive (default)> select count(*) from log_parquet;

_c0

100000

Time taken: 22.922 seconds, Fetched: 1 row(s)

Time taken: 21.074 seconds, Fetched: 1 row(s)

Time taken: 18.384 seconds, Fetched: 1 row(s)

存儲文件的查詢速度總結：查詢速度相近。

8.6 存儲和壓縮結合

8.6.1 修改Hadoop集羣具備Snappy壓縮方式

1．查看hadoop checknative命令使用

[atguigu@hadoop104 hadoop-2.7.2]$ hadoop

checknative [-a|-h] check native hadoop and compression libraries availability

2．查看hadoop支持的壓縮方式

[atguigu@hadoop104 hadoop-2.7.2]$ hadoop checknative

17/12/24 20:32:52 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version

17/12/24 20:32:52 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

Native library checking:

hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so

zlib: true /lib64/libz.so.1

snappy: false

lz4: true revision:99

bzip2: false

3．將編譯好的支持Snappy壓縮的hadoop-2.7.2.tar.gz包導入到hadoop102的/opt/software中

4．解壓hadoop-2.7.2.tar.gz到當前路徑

[atguigu@hadoop102 software]$ tar -zxvf hadoop-2.7.2.tar.gz

5．進入到/opt/software/hadoop-2.7.2/lib/native路徑能夠看到支持Snappy壓縮的動態連接庫

[atguigu@hadoop102 native]$ pwd

/opt/software/hadoop-2.7.2/lib/native

[atguigu@hadoop102 native]$ ll

-rw-r--r--. 1 atguigu atguigu 472950 9月 1 10:19 libsnappy.a

-rwxr-xr-x. 1 atguigu atguigu 955 9月 1 10:19 libsnappy.la

lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so -> libsnappy.so.1.3.0

lrwxrwxrwx. 1 atguigu atguigu 18 12月 24 20:39 libsnappy.so.1 -> libsnappy.so.1.3.0

-rwxr-xr-x. 1 atguigu atguigu 228177 9月 1 10:19 libsnappy.so.1.3.0

6．拷貝/opt/software/hadoop-2.7.2/lib/native裏面的全部內容到開發集羣的/opt/module/hadoop-2.7.2/lib/native路徑上

[atguigu@hadoop102 native]$ cp ../native/* /opt/module/hadoop-2.7.2/lib/native/

7．分發集羣

[atguigu@hadoop102 lib]$ xsync native/

8．再次查看hadoop支持的壓縮類型

[atguigu@hadoop102 hadoop-2.7.2]$ hadoop checknative

17/12/24 20:45:02 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version

17/12/24 20:45:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

Native library checking:

hadoop: true /opt/module/hadoop-2.7.2/lib/native/libhadoop.so

zlib: true /lib64/libz.so.1

snappy: true /opt/module/hadoop-2.7.2/lib/native/libsnappy.so.1

lz4: true revision:99

bzip2: false

9．從新啓動hadoop集羣和hive

8.6.2 測試存儲和壓縮

官網：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

ORC存儲方式的壓縮：

表6-12

Key	Default	Notes
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size	262,144	number of bytes in each compression chunk
orc.stripe.size	67,108,864	number of bytes in each stripe
orc.row.index.stride	10,000	number of rows between index entries (must be >= 1000)
orc.create.index	true	whether to create row indexes
orc.bloom.filter.columns	""	comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp	0.05	false positive probability for bloom filter (must >0.0 and <1.0)

1．建立一個非壓縮的的ORC存儲方式

（1）建表語句

create table log_orc_none(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as orc tblproperties ("orc.compress"="NONE");

（2）插入數據

hive (default)> insert into table log_orc_none select * from log_text ;

（3）查看插入後數據

hive (default)> dfs -du -h /user/hive/warehouse/log_orc_none/ ;

7.7 M /user/hive/warehouse/log_orc_none/000000_0

2．建立一個SNAPPY壓縮的ORC存儲方式

（1）建表語句

create table log_orc_snappy(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

row format delimited fields terminated by '\t'

stored as orc tblproperties ("orc.compress"="SNAPPY");

（2）插入數據

hive (default)> insert into table log_orc_snappy select * from log_text ;

（3）查看插入後數據

hive (default)> dfs -du -h /user/hive/warehouse/log_orc_snappy/ ;

3.8 M /user/hive/warehouse/log_orc_snappy/000000_0

3．上一節中默認建立的ORC存儲方式，導入數據後的大小爲

2.8 M /user/hive/warehouse/log_orc/000000_0

比Snappy壓縮的還小。緣由是orc存儲文件默認採用ZLIB壓縮。比snappy壓縮的小。

4．存儲方式和壓縮總結

在實際的項目開發當中，hive表的數據存儲格式通常選擇：orc或parquet。壓縮方式通常選擇snappy，lzo。

第9章企業級調優

9.1 Fetch抓取

Fetch抓取是指，Hive中對某些狀況的查詢能夠沒必要使用MapReduce計算。例如：SELECT * FROM employees;在這種狀況下，Hive能夠簡單地讀取employee對應的存儲目錄下的文件，而後輸出查詢結果到控制檯。

在hive-default.xml.template文件中hive.fetch.task.conversion默認是more，老版本hive默認是minimal，該屬性修改成more之後，在全局查找、字段查找、limit查找等都不走mapreduce。

<name>hive.fetch.task.conversion</name>

Expects one of [none, minimal, more].

Some select queries can be converted to single FETCH task minimizing latency.

Currently the query should be single sourced not having any subquery and should not have

any aggregations or distincts (which incurs RS), lateral views and joins.

0. none : disable hive.fetch.task.conversion

1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)

</description>

</property>

案例實操：

1）把hive.fetch.task.conversion設置成none，而後執行查詢語句，都會執行mapreduce程序。

hive (default)> set hive.fetch.task.conversion=none;

hive (default)> select * from emp;

hive (default)> select ename from emp;

hive (default)> select ename from emp limit 3;

2）把hive.fetch.task.conversion設置成more，而後執行查詢語句，以下查詢方式都不會執行mapreduce程序。

hive (default)> set hive.fetch.task.conversion=more;

hive (default)> select * from emp;

hive (default)> select ename from emp;

hive (default)> select ename from emp limit 3;

9.2 本地模式

大多數的Hadoop Job是須要Hadoop提供的完整的可擴展性來處理大數據集的。不過，有時Hive的輸入數據量是很是小的。在這種狀況下，爲查詢觸發執行任務消耗的時間可能會比實際job的執行時間要多的多。對於大多數這種狀況，Hive能夠經過本地模式在單臺機器上處理全部的任務。對於小數據集，執行時間能夠明顯被縮短。

用戶能夠經過設置hive.exec.mode.local.auto的值爲true，來讓Hive在適當的時候自動啓動這個優化。

set hive.exec.mode.local.auto=true; //開啓本地mr

//設置local mr的最大輸入數據量，當輸入數據量小於這個值時採用local mr的方式，默認爲134217728，即128M

set hive.exec.mode.local.auto.inputbytes.max=50000000;

//設置local mr的最大輸入文件個數，當輸入文件個數小於這個值時採用local mr的方式，默認爲4

set hive.exec.mode.local.auto.input.files.max=10;

案例實操：

1）開啓本地模式，並執行查詢語句

hive (default)> set hive.exec.mode.local.auto=true;

hive (default)> select * from emp cluster by deptno;

Time taken: 1.328 seconds, Fetched: 14 row(s)

2）關閉本地模式，並執行查詢語句

hive (default)> set hive.exec.mode.local.auto=false;

hive (default)> select * from emp cluster by deptno;

Time taken: 20.09 seconds, Fetched: 14 row(s)

9.3 表的優化

9.3.1 小表、大表Join

將key相對分散，而且數據量小的表放在join的左邊，這樣能夠有效減小內存溢出錯誤發生的概率；再進一步，可使用map join讓小的維度表（1000條如下的記錄條數）先進內存。在map端完成reduce。

實際測試發現：新版的hive已經對小表JOIN大表和大表JOIN小表進行了優化。小表放在左邊和右邊已經沒有明顯區別。

案例實操

需求

測試大表JOIN小表和小表JOIN大表的效率

2．建大表、小表和JOIN後表的語句

// 建立大表

create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 建立小表

create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 建立join後表的語句

create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

3．分別向大表和小表中導入數據

hive (default)> load data local inpath '/opt/module/datas/bigtable' into table bigtable;

hive (default)>load data local inpath '/opt/module/datas/smalltable' into table smalltable;

4．關閉mapjoin功能（默認是打開的）

set hive.auto.convert.join = false;

5．執行小表JOIN大表語句

insert overwrite table jointable

select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from smalltable s

left join bigtable b

on b.id = s.id;

Time taken: 35.921 seconds

No rows affected (44.456 seconds)

6．執行大表JOIN小表語句

insert overwrite table jointable

select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from bigtable b

left join smalltable s

on s.id = b.id;

Time taken: 34.196 seconds

No rows affected (26.287 seconds)

9.3.2 大表Join大表

1．空KEY過濾

有時join超時是由於某些key對應的數據太多，而相同key對應的數據都會發送到相同的reducer上，從而致使內存不夠。此時咱們應該仔細分析這些異常的key，不少狀況下，這些key對應的數據是異常數據，咱們須要在SQL語句中進行過濾。例如key對應的字段爲空，操做以下：

案例實操

（1）配置歷史服務器

配置mapred-site.xml

<name>mapreduce.jobhistory.address</name>

<value>hadoop102:10020</value>

</property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>hadoop102:19888</value>

</property>

啓動歷史服務器

sbin/mr-jobhistory-daemon.sh start historyserver

查看jobhistory

http://hadoop102:19888/jobhistory

（2）建立原始數據表、空id表、合併後數據表

// 建立原始表

create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 建立空id表

create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

// 建立join後表的語句

create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

（3）分別加載原始數據和空id數據到對應表中

hive (default)> load data local inpath '/opt/module/datas/ori' into table ori;

hive (default)> load data local inpath '/opt/module/datas/nullid' into table nullidtable;

（4）測試不過濾空id

hive (default)> insert overwrite table jointable

select n.* from nullidtable n left join ori o on n.id = o.id;

Time taken: 42.038 seconds

Time taken: 37.284 seconds

（5）測試過濾空id

hive (default)> insert overwrite table jointable

select n.* from (select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;

Time taken: 31.725 seconds

Time taken: 28.876 seconds

2．空key轉換

有時雖然某個key爲空對應的數據不少，可是相應的數據不是異常數據，必需要包含在join的結果中，此時咱們能夠表a中key爲空的字段賦一個隨機的值，使得數據隨機均勻地分不到不一樣的reducer上。例如：

案例實操：

不隨機分佈空null值：

（1）設置5個reduce個數

set mapreduce.job.reduces = 5;

（2）JOIN兩張表

insert overwrite table jointable

select n.* from nullidtable n left join ori b on n.id = b.id;

結果：如圖6-13所示，能夠看出來，出現了數據傾斜，某些reducer的資源消耗遠大於其餘reducer。

圖6-13 空key轉換

隨機分佈空null值

（1）設置5個reduce個數

set mapreduce.job.reduces = 5;

（2）JOIN兩張表

insert overwrite table jointable

select n.* from nullidtable n full join ori o on

case when n.id is null then concat('hive', rand()) else n.id end = o.id;

結果：如圖6-14所示，能夠看出來，消除了數據傾斜，負載均衡reducer的資源消耗

圖6-14 隨機分佈空值

9.3.3 MapJoin

若是不指定MapJoin或者不符合MapJoin的條件，那麼Hive解析器會將Join操做轉換成Common Join，即：在Reduce階段完成join。容易發生數據傾斜。能夠用MapJoin把小表所有加載到內存在map端進行join，避免reducer處理。

1．開啓MapJoin參數設置

（1）設置自動選擇Mapjoin

set hive.auto.convert.join = true; 默認爲true

（2）大表小表的閾值設置（默認25M一下認爲是小表）：

set hive.mapjoin.smalltable.filesize=25000000;

2．MapJoin工做機制，如圖6-15所示

圖6-15 MapJoin工做機制

案例實操：

（1）開啓Mapjoin功能

set hive.auto.convert.join = true; 默認爲true

（2）執行小表JOIN大表語句

insert overwrite table jointable

select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from smalltable s

join bigtable b

on s.id = b.id;

Time taken: 24.594 seconds

（3）執行大表JOIN小表語句

insert overwrite table jointable

select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from bigtable b

join smalltable s

on s.id = b.id;

Time taken: 24.315 seconds

9.3.4 Group By

默認狀況下，Map階段同一Key數據分發給一個reduce，當一個key數據過大時就傾斜了。

並非全部的聚合操做都須要在Reduce端完成，不少聚合操做均可以先在Map端進行部分聚合，最後在Reduce端得出最終結果。

1．開啓Map端聚合參數設置

（1）是否在Map端進行聚合，默認爲True

hive.map.aggr = true

（2）在Map端進行聚合操做的條目數目

hive.groupby.mapaggr.checkinterval = 100000

（3）有數據傾斜的時候進行負載均衡（默認是false）

hive.groupby.skewindata = true

當選項設定爲 true，生成的查詢計劃會有兩個MR Job。第一個MR Job中，Map的輸出結果會隨機分佈到Reduce中，每一個Reduce作部分聚合操做，並輸出結果，這樣處理的結果是相同的Group By Key有可能被分發到不一樣的Reduce中，從而達到負載均衡的目的；第二個MR Job再根據預處理的數據結果按照Group By Key分佈到Reduce中（這個過程能夠保證相同的Group By Key被分佈到同一個Reduce中），最後完成最終的聚合操做。

9.3.5 Count(Distinct) 去重統計

數據量小的時候無所謂，數據量大的狀況下，因爲COUNT DISTINCT操做須要用一個Reduce Task來完成，這一個Reduce須要處理的數據量太大，就會致使整個Job很難完成，通常COUNT DISTINCT使用先GROUP BY再COUNT的方式替換：

案例實操

建立一張大表

hive (default)> create table bigtable(id bigint, time bigint, uid string, keyword

string, url_rank int, click_num int, click_url string) row format delimited

fields terminated by '\t';

2．加載數據

hive (default)> load data local inpath '/opt/module/datas/bigtable' into table

bigtable;

3．設置5個reduce個數

set mapreduce.job.reduces = 5;

4．執行去重id查詢

hive (default)> select count(distinct id) from bigtable;

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.12 sec HDFS Read: 120741990 HDFS Write: 7 SUCCESS

Total MapReduce CPU Time Spent: 7 seconds 120 msec

100001

Time taken: 23.607 seconds, Fetched: 1 row(s)

5．採用GROUP by去重id

hive (default)> select count(id) from (select id from bigtable group by id) a;

Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 17.53 sec HDFS Read: 120752703 HDFS Write: 580 SUCCESS

Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.29 sec HDFS Read: 9409 HDFS Write: 7 SUCCESS

Total MapReduce CPU Time Spent: 21 seconds 820 msec

_c0

100001

Time taken: 50.795 seconds, Fetched: 1 row(s)

雖然會多用一個Job來完成，但在數據量大的狀況下，這個絕對是值得的。

9.3.6 笛卡爾積

儘可能避免笛卡爾積，join的時候不加on條件，或者無效的on條件，Hive只能使用1個reducer來完成笛卡爾積。

9.3.7 行列過濾

列處理：在SELECT中，只拿須要的列，若是有，儘可能使用分區過濾，少用SELECT *。

行處理：在分區剪裁中，當使用外關聯時，若是將副表的過濾條件寫在Where後面，那麼就會先全表關聯，以後再過濾，好比：

案例實操：

1．測試先關聯兩張表，再用where條件過濾

hive (default)> select o.id from bigtable b

join ori o on o.id = b.id

where o.id <= 10;

Time taken: 34.406 seconds, Fetched: 100 row(s)

2．經過子查詢後，再關聯表

hive (default)> select b.id from bigtable b

join (select id from ori where id <= 10 ) o on b.id = o.id;

Time taken: 30.058 seconds, Fetched: 100 row(s)

9.3.8 動態分區調整

關係型數據庫中，對分區表Insert數據時候，數據庫自動會根據分區字段的值，將數據插入到相應的分區中，Hive中也提供了相似的機制，即動態分區(Dynamic Partition)，只不過，使用Hive的動態分區，須要進行相應的配置。

1．開啓動態分區參數設置

（1）開啓動態分區功能（默認true，開啓）

hive.exec.dynamic.partition=true

（2）設置爲非嚴格模式（動態分區的模式，默認strict，表示必須指定至少一個分區爲靜態分區，nonstrict模式表示容許全部的分區字段均可以使用動態分區。）

hive.exec.dynamic.partition.mode=nonstrict

（3）在全部執行MR的節點上，最大一共能夠建立多少個動態分區。

hive.exec.max.dynamic.partitions=1000

（4）在每一個執行MR的節點上，最大能夠建立多少個動態分區。該參數須要根據實際的數據來設定。好比：源數據中包含了一年的數據，即day字段有365個值，那麼該參數就須要設置成大於365，若是使用默認值100，則會報錯。

hive.exec.max.dynamic.partitions.pernode=100

（5）整個MR Job中，最大能夠建立多少個HDFS文件。

hive.exec.max.created.files=100000

（6）當有空分區生成時，是否拋出異常。通常不須要設置。

hive.error.on.empty.partition=false

2．案例實操

需求：將ori中的數據按照時間(如：20111230000008)，插入到目標表ori_partitioned_target的相應分區中。

（1）建立分區表

create table ori_partitioned(id bigint, time bigint, uid string, keyword string,

url_rank int, click_num int, click_url string)

partitioned by (p_time bigint)

row format delimited fields terminated by '\t';

（2）加載數據到分區表中

hive (default)> load data local inpath '/home/atguigu/ds1' into table

ori_partitioned partition(p_time='20111230000010') ;

hive (default)> load data local inpath '/home/atguigu/ds2' into table ori_partitioned partition(p_time='20111230000011') ;

（3）建立目標分區表

create table ori_partitioned_target(id bigint, time bigint, uid string,

keyword string, url_rank int, click_num int, click_url string) PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';

（4）設置動態分區

set hive.exec.dynamic.partition = true;

set hive.exec.dynamic.partition.mode = nonstrict;

set hive.exec.max.dynamic.partitions = 1000;

set hive.exec.max.dynamic.partitions.pernode = 100;

set hive.exec.max.created.files = 100000;

set hive.error.on.empty.partition = false;

hive (default)> insert overwrite table ori_partitioned_target partition (p_time)

select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;

（5）查看目標分區表的分區狀況

hive (default)> show partitions ori_partitioned_target;

9.3.9 分桶

詳見6.6章。

9.3.10 分區

詳見4.6章。

9.4 數據傾斜

9.4.1 合理設置Map數

1）一般狀況下，做業會經過input的目錄產生一個或者多個map任務。

主要的決定因素有：input的文件總個數，input的文件大小，集羣設置的文件塊大小。

2）是否是map數越多越好？

答案是否認的。若是一個任務有不少小文件（遠遠小於塊大小128m），則每一個小文件也會被當作一個塊，用一個map任務來完成，而一個map任務啓動和初始化的時間遠遠大於邏輯處理的時間，就會形成很大的資源浪費。並且，同時可執行的map數是受限的。

3）是否是保證每一個map處理接近128m的文件塊，就高枕無憂了？

答案也是不必定。好比有一個127m的文件，正常會用一個map去完成，但這個文件只有一個或者兩個小字段，卻有幾千萬的記錄，若是map處理的邏輯比較複雜，用一個map任務去作，確定也比較耗時。

針對上面的問題2和3，咱們須要採起兩種方式來解決：即減小map數和增長map數；

9.4.2 小文件進行合併

在map執行前合併小文件，減小map數：CombineHiveInputFormat具備對小文件進行合併的功能（系統默認的格式）。HiveInputFormat沒有對小文件合併功能。

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

9.4.3 複雜文件增長Map數

當input的文件都很大，任務邏輯複雜，map執行很是慢的時候，能夠考慮增長Map數，來使得每一個map處理的數據量減小，從而提升任務的執行效率。

增長map的方法爲：根據computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M公式，調整maxSize最大值。讓maxSize最大值低於blocksize就能夠增長map的個數。

案例實操：

1．執行查詢

hive (default)> select count(*) from emp;

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2．設置最大切片值爲100個字節

hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;

hive (default)> select count(*) from emp;

Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1

9.4.4 合理設置Reduce數

1．調整reduce個數方法一

（1）每一個Reduce處理的數據量默認是256MB

hive.exec.reducers.bytes.per.reducer=256000000

（2）每一個任務最大的reduce數，默認爲1009

hive.exec.reducers.max=1009

（3）計算reducer數的公式

N=min(參數2，總輸入數據量/參數1)

2．調整reduce個數方法二

在hadoop的mapred-default.xml文件中修改

設置每一個job的Reduce個數

set mapreduce.job.reduces = 15;

3．reduce個數並非越多越好

1）過多的啓動和初始化reduce也會消耗時間和資源；

2）另外，有多少個reduce，就會有多少個輸出文件，若是生成了不少個小文件，那麼若是這些小文件做爲下一個任務的輸入，則也會出現小文件過多的問題；

在設置reduce個數的時候也須要考慮這兩個原則：處理大數據量利用合適的reduce數；使單個reduce任務處理數據量大小要合適；

9.5 並行執行

Hive會將一個查詢轉化成一個或者多個階段。這樣的階段能夠是MapReduce階段、抽樣階段、合併階段、limit階段。或者Hive執行過程當中可能須要的其餘階段。默認狀況下，Hive一次只會執行一個階段。不過，某個特定的job可能包含衆多的階段，而這些階段可能並不是徹底互相依賴的，也就是說有些階段是能夠並行執行的，這樣可能使得整個job的執行時間縮短。不過，若是有更多的階段能夠並行執行，那麼job可能就越快完成。

經過設置參數hive.exec.parallel值爲true，就能夠開啓併發執行。不過，在共享集羣中，須要注意下，若是job中並行階段增多，那麼集羣利用率就會增長。

set hive.exec.parallel=true; //打開任務並行執行

set hive.exec.parallel.thread.number=16; //同一個sql容許最大並行度，默認爲8。

固然，得是在系統資源比較空閒的時候纔有優點，不然，沒資源，並行也起不來。

9.6 嚴格模式

Hive提供了一個嚴格模式，能夠防止用戶執行那些可能意想不到的很差的影響的查詢。

經過設置屬性hive.mapred.mode值爲默認是非嚴格模式nonstrict 。開啓嚴格模式須要修改hive.mapred.mode值爲strict，開啓嚴格模式能夠禁止3種類型的查詢。

<name>hive.mapred.mode</name>

<value>strict</value>

The mode in which the Hive operations are being performed.

In strict mode, some risky queries are not allowed to run. They include:

Cartesian Product.

No partition being picked up for a query.

Comparing bigints and strings.

Comparing bigints and doubles.

Orderby without limit.

</description>

</property>

對於分區表，除非where語句中含有分區字段過濾條件來限制範圍，不然不容許執行。換句話說，就是用戶不容許掃描全部分區。進行這個限制的緣由是，一般分區表都擁有很是大的數據集，並且數據增長迅速。沒有進行分區限制的查詢可能會消耗使人不可接受的巨大資源來處理這個表。
對於使用了order by語句的查詢，要求必須使用limit語句。由於order by爲了執行排序過程會將全部的結果數據分發到同一個Reducer中進行處理，強制要求用戶增長這個LIMIT語句能夠防止Reducer額外執行很長一段時間。
限制笛卡爾積的查詢。對關係型數據庫很是瞭解的用戶可能指望在執行JOIN查詢的時候不使用ON語句而是使用where語句，這樣關係數據庫的執行優化器就能夠高效地將WHERE語句轉化成那個ON語句。不幸的是，Hive並不會執行這種優化，所以，若是表足夠大，那麼這個查詢就會出現不可控的狀況。

9.7 JVM重用

JVM重用是Hadoop調優參數的內容，其對Hive的性能具備很是大的影響，特別是對於很難避免小文件的場景或task特別多的場景，這類場景大多數執行時間都很短。

Hadoop的默認配置一般是使用派生JVM來執行map和Reduce任務的。這時JVM的啓動過程可能會形成至關大的開銷，尤爲是執行的job包含有成百上千task任務的狀況。JVM重用可使得JVM實例在同一個job中從新使用N次。N的值能夠在Hadoop的mapred-site.xml文件中進行配置。一般在10-20之間，具體多少須要根據具體業務場景測試得出。

<name>mapreduce.job.jvm.numtasks</name>

<description>How many tasks to run per jvm. If set to -1, there is

no limit.

</description>

</property>

這個功能的缺點是，開啓JVM重用將一直佔用使用到的task插槽，以便進行重用，直到任務完成後才能釋放。若是某個"不平衡的"job中有某幾個reduce task執行的時間要比其餘Reduce task消耗的時間多的多的話，那麼保留的插槽就會一直空閒着卻沒法被其餘的job使用，直到全部的task都結束了纔會釋放。

9.8 推測執行

在分佈式集羣環境下，由於程序Bug（包括Hadoop自己的bug），負載不均衡或者資源分佈不均等緣由，會形成同一個做業的多個任務之間運行速度不一致，有些任務的運行速度可能明顯慢於其餘任務（好比一個做業的某個任務進度只有50%，而其餘全部任務已經運行完畢），則這些任務會拖慢做業的總體執行進度。爲了不這種狀況發生，Hadoop採用了推測執行（Speculative Execution）機制，它根據必定的法則推測出"拖後腿"的任務，併爲這樣的任務啓動一個備份任務，讓該任務與原始任務同時處理同一份數據，並最終選用最早成功運行完成任務的計算結果做爲最終結果。

設置開啓推測執行參數：Hadoop的mapred-site.xml文件中進行配置

<name>mapreduce.map.speculative</name>

<description>If true, then multiple instances of some map tasks

may be executed in parallel.</description>

</property>

<name>mapreduce.reduce.speculative</name>

<description>If true, then multiple instances of some reduce tasks

may be executed in parallel.</description>

</property>

不過hive自己也提供了配置項來控制reduce-side的推測執行：

<name>hive.mapred.reduce.tasks.speculative.execution</name>

<description>Whether speculative execution for reducers should be turned on. </description>

</property>

關於調優這些推測執行變量，還很難給一個具體的建議。若是用戶對於運行時的誤差很是敏感的話，那麼能夠將這些功能關閉掉。若是用戶由於輸入數據量很大而須要執行長時間的map或者Reduce task的話，那麼啓動推測執行形成的浪費是很是巨大大。

9.9 壓縮

詳見第8章。

9.10 執行計劃（Explain）

1．基本語法

EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query

2．案例實操

（1）查看下面這條語句的執行計劃

hive (default)> explain select * from emp;

hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno;

（2）查看詳細執行計劃

hive (default)> explain extended select * from emp;

hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;

第10章 Hive實戰之穀粒影音

10.1 需求描述

統計硅谷影音視頻網站的常規指標，各類TopN指標：

--統計視頻觀看數Top10

--統計視頻類別熱度Top10

--統計視頻觀看數Top20所屬類別

--統計視頻觀看數Top50所關聯視頻的所屬類別Rank

--統計每一個類別中的視頻熱度Top10

--統計每一個類別中視頻流量Top10

--統計上傳視頻最多的用戶Top10以及他們上傳的視頻

--統計每一個類別視頻觀看數Top10

10.2 項目

10.2.1 數據結構

1．視頻表

表6-13 視頻表

字段	備註	詳細描述
video id	視頻惟一id	11位字符串
uploader	視頻上傳者	上傳視頻的用戶名String
age	視頻年齡	視頻在平臺上的整數天
category	視頻類別	上傳視頻指定的視頻分類
length	視頻長度	整形數字標識的視頻長度
views	觀看次數	視頻被瀏覽的次數
rate	視頻評分	滿分5分
ratings	流量	視頻的流量，整型數字
conments	評論數	一個視頻的整數評論數
related ids	相關視頻id	相關視頻的id，最多20個

2．用戶表

表6-14 用戶表

字段	備註	字段類型
uploader	上傳者用戶名	string
videos	上傳視頻數	int
friends	朋友數量	int

10.2.2 ETL原始數據

經過觀察原始數據形式，能夠發現，視頻能夠有多個所屬分類，每一個所屬分類用&符號分割，且分割的兩邊有空格字符，同時相關視頻也是能夠有多個元素，多個相關視頻又用"\t"進行分割。爲了分析數據時方便對存在多個子元素的數據進行操做，咱們首先進行數據重組清洗操做。即：將全部的類別用"&"分割，同時去掉兩邊空格，多個相關視頻id也使用"&"進行分割。

1．ETL之ETLUtil

public class ETLUtil {

public static String oriString2ETLString(String ori){

StringBuilder etlString = new StringBuilder();

String[] splits = ori.split("\t");

if(splits.length < 9) return null;

splits[3] = splits[3].replace(" ", "");

for(int i = 0; i < splits.length; i++){

if(i < 9){

if(i == splits.length - 1){

etlString.append(splits[i]);

}else{

etlString.append(splits[i] + "\t");

}

}else{

if(i == splits.length - 1){

etlString.append(splits[i]);

}else{

etlString.append(splits[i] + "&");

}

return etlString.toString();

}

2．ETL之Mapper

import java.io.IOException;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import com.atguigu.util.ETLUtil;

public class VideoETLMapper extends Mapper<Object, Text, NullWritable, Text>{

Text text = new Text();

@Override

protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

String etlString = ETLUtil.oriString2ETLString(value.toString());

if(StringUtils.isBlank(etlString)) return;

text.set(etlString);

context.write(NullWritable.get(), text);

}

3．ETL之Runner

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class VideoETLRunner implements Tool {

private Configuration conf = null;

@Override

public void setConf(Configuration conf) {

this.conf = conf;

}

@Override

public Configuration getConf() {

return this.conf;

}

@Override

public int run(String[] args) throws Exception {

conf = this.getConf();

conf.set("inpath", args[0]);

conf.set("outpath", args[1]);

Job job = Job.getInstance(conf);

job.setJarByClass(VideoETLRunner.class);

job.setMapperClass(VideoETLMapper.class);

job.setMapOutputKeyClass(NullWritable.class);

job.setMapOutputValueClass(Text.class);

job.setNumReduceTasks(0);

this.initJobInputPath(job);

this.initJobOutputPath(job);

return job.waitForCompletion(true) ? 0 : 1;

}

private void initJobOutputPath(Job job) throws IOException {

Configuration conf = job.getConfiguration();

String outPathString = conf.get("outpath");

FileSystem fs = FileSystem.get(conf);

Path outPath = new Path(outPathString);

if(fs.exists(outPath)){

fs.delete(outPath, true);

}

FileOutputFormat.setOutputPath(job, outPath);

}

private void initJobInputPath(Job job) throws IOException {

Configuration conf = job.getConfiguration();

String inPathString = conf.get("inpath");

FileSystem fs = FileSystem.get(conf);

Path inPath = new Path(inPathString);

if(fs.exists(inPath)){

FileInputFormat.addInputPath(job, inPath);

}else{

throw new RuntimeException("HDFS中該文件目錄不存在：" + inPathString);

}

public static void main(String[] args) {

try {

int resultCode = ToolRunner.run(new VideoETLRunner(), args);

if(resultCode == 0){

System.out.println("Success!");

}else{

System.out.println("Fail!");

}

System.exit(resultCode);

} catch (Exception e) {

e.printStackTrace();

System.exit(1);

}

4．執行ETL

$ bin/yarn jar ~/softwares/jars/gulivideo-0.0.1-SNAPSHOT.jar \

com.atguigu.etl.ETLVideosRunner \

/gulivideo/video/2008/0222 \

/gulivideo/output/video/2008/0222

10.3 準備工做

10.3.1 建立表

建立表：gulivideo_ori，gulivideo_user_ori，

建立表：gulivideo_orc，gulivideo_user_orc

gulivideo_ori：

create table gulivideo_ori(

videoId string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

gulivideo_user_ori：

create table gulivideo_user_ori(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as textfile;

而後把原始數據插入到orc表中

gulivideo_orc：

create table gulivideo_orc(

videoId string,

uploader string,

age int,

category array<string>,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

clustered by (uploader) into 8 buckets

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

gulivideo_user_orc：

create table gulivideo_user_orc(

uploader string,

videos int,

friends int)

row format delimited

fields terminated by "\t"

stored as orc;

10.3.2 導入ETL後的數據

gulivideo_ori：

load data inpath "/gulivideo/output/video/2008/0222" into table gulivideo_ori;

gulivideo_user_ori：

load data inpath "/gulivideo/user/2008/0903" into table gulivideo_user_ori;

10.3.3 向ORC表插入數據

gulivideo_orc：

insert into table gulivideo_orc select * from gulivideo_ori;

gulivideo_user_orc：

insert into table gulivideo_user_orc select * from gulivideo_user_ori;

10.4 業務分析

10.4.1 統計視頻觀看數Top10

思路：使用order by按照views字段作一個全局排序便可，同時咱們設置只顯示前10條。

最終代碼：

select

videoId,

uploader,

age,

category,

length,

views,

rate,

ratings,

comments

from

gulivideo_orc

order by

views

desc limit

10;

10.4.2 統計視頻類別熱度Top10

思路：

1) 即統計每一個類別有多少個視頻，顯示出包含視頻最多的前10個類別。

2) 咱們須要按照類別group by聚合，而後count組內的videoId個數便可。

3) 由於當前表結構爲：一個視頻對應一個或多個類別。因此若是要group by類別，須要先將類別進行列轉行(展開)，而後再進行count便可。

4) 最後按照熱度排序，顯示前10條。

最終代碼：

select

category_name as category,

count(t1.videoId) as hot

from (

select

videoId,

category_name

from

gulivideo_orc lateral view explode(category) t_catetory as category_name) t1

group by

t1.category_name

order by

hot

desc limit

10;

10.4.3 統計出視頻觀看數最高的20個視頻的所屬類別以及類別包含Top20視頻的個數

思路：

1) 先找到觀看數最高的20個視頻所屬條目的全部信息，降序排列

2) 把這20條信息中的category分裂出來(列轉行)

3) 最後查詢視頻分類名稱和該分類下有多少個Top20的視頻

最終代碼：

select

category_name as category,

count(t2.videoId) as hot_with_views

from (

select

videoId,

category_name

from (

select

from

gulivideo_orc

order by

views

desc limit

20) t1 lateral view explode(category) t_catetory as category_name) t2

group by

category_name

order by

hot_with_views

desc;

10.4.4 統計視頻觀看數Top50所關聯視頻的所屬類別Rank

思路：

查詢出觀看數最多的前50個視頻的全部信息(固然包含了每一個視頻對應的關聯視頻)，記爲臨時表t1

t1：觀看數前50的視頻

select

from

gulivideo_orc

order by

views

desc limit

50;

將找到的50條視頻信息的相關視頻relatedId列轉行，記爲臨時表t2

t2：將相關視頻的id進行列轉行操做

select

explode(relatedId) as videoId

from

t1;

將相關視頻的id和gulivideo_orc表進行inner join操做

t5：獲得兩列數據，一列是category，一列是以前查詢出來的相關視頻id

(select

distinct(t2.videoId),

t3.category

from

inner join

gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;

4) 按照視頻類別進行分組，統計每組視頻個數，而後排行

最終代碼：

select

category_name as category,

count(t5.videoId) as hot

from (

select

videoId,

category_name

from (

select

distinct(t2.videoId),

t3.category

from (

select

explode(relatedId) as videoId

from (

select

from

gulivideo_orc

order by

views

desc limit

50) t1) t2

inner join

gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5

group by

category_name

order by

hot

desc;

10.4.5 統計每一個類別中的視頻熱度Top10，以Music爲例

思路：

1) 要想統計Music類別中的視頻熱度Top10，須要先找到Music類別，那麼就須要將category展開，因此能夠建立一張表用於存放categoryId展開的數據。

2) 向category展開的表中插入數據。

3) 統計對應類別（Music）中的視頻熱度。

最終代碼：

建立表類別表：

create table gulivideo_category(

videoId string,

uploader string,

age int,

categoryId string,

length int,

views int,

rate float,

ratings int,

comments int,

relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as orc;

向類別表中插入數據：

insert into table gulivideo_category

select

videoId,

uploader,

age,

categoryId,

length,

views,

rate,

ratings,

comments,

relatedId

from

gulivideo_orc lateral view explode(category) catetory as categoryId;

統計Music類別的Top10（也能夠統計其餘）

select

videoId,

views

from

gulivideo_category

where

categoryId = "Music"

order by

views

desc limit

10;

10.4.6 統計每一個類別中視頻流量Top10，以Music爲例

思路：

1) 建立視頻類別展開表（categoryId列轉行後的表）

2) 按照ratings排序便可

最終代碼：

select

videoId,

views,

ratings

from

gulivideo_category

where

categoryId = "Music"

order by

ratings

desc limit

10;

10.4.7 統計上傳視頻最多的用戶Top10以及他們上傳的觀看次數在前20的視頻

思路：

1) 先找到上傳視頻最多的10個用戶的用戶信息

select

from

gulivideo_user_orc

order by

videos

desc limit

10;

2) 經過uploader字段與gulivideo_orc表進行join，獲得的信息按照views觀看次數進行排序便可。

最終代碼：

select

t2.videoId,

t2.views,

t2.ratings,

t1.videos,

t1.friends

from (

select

from

gulivideo_user_orc

order by

videos desc

limit

10) t1

join

gulivideo_orc t2

t1.uploader = t2.uploader

order by

views desc

limit

20;

10.4.8 統計每一個類別視頻觀看數Top10

思路：

1) 先獲得categoryId展開的表數據

2) 子查詢按照categoryId進行分區，而後分區內排序，並生成遞增數字，該遞增數字這一列起名爲rank列

3) 經過子查詢產生的臨時表，查詢rank值小於等於10的數據行便可。

最終代碼：

select

t1.*

from (

select

videoId,

categoryId,

views,

row_number() over(partition by categoryId order by views desc) rank from gulivideo_category) t1

where

rank <= 10;

第11章常見錯誤及解決方案

1）SecureCRT 7.3出現亂碼或者刪除不掉數據，免安裝版的SecureCRT 卸載或者用虛擬機直接操做或者換安裝版的SecureCRT

2）鏈接不上mysql數據庫

（1）導錯驅動包，應該把mysql-connector-java-5.1.27-bin.jar導入/opt/module/hive/lib的不是這個包。錯把mysql-connector-java-5.1.27.tar.gz導入hive/lib包下。

（2）修改user表中的主機名稱沒有都修改成%，而是修改成localhost

3）hive默認的輸入格式處理是CombineHiveInputFormat，會對小文件進行合併。

hive (default)> set hive.input.format;

hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

能夠採用HiveInputFormat就會根據分區數輸出相應的文件。

hive (default)> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

4）不能執行mapreduce程序

多是hadoop的yarn沒開啓。

5）啓動mysql服務時，報MySQL server PID file could not be found! 異常。

在/var/lock/subsys/mysql路徑下建立hadoop102.pid，並在文件中添加內容：4396

6）報service mysql status MySQL is not running, but lock file (/var/lock/subsys/mysql[失敗])異常。

解決方案：在/var/lib/mysql 目錄下建立： -rw-rw----. 1 mysql mysql 5 12月 22 16:41 hadoop102.pid 文件，並修改權限爲 777。

7）JVM堆內存溢出

描述：java.lang.OutOfMemoryError: Java heap space

解決：在yarn-site.xml中加入以下代碼

<name>yarn.scheduler.maximum-allocation-mb</name>

</property>

<name>yarn.scheduler.minimum-allocation-mb</name>

</property>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

</property>

<name>mapred.child.java.opts</name>

</property>

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

大數據技術之Hive

第1章 Hive入門

1.1 什麼是Hive

1.2 Hive的優缺點

1.2.1 優勢

1.2.2 缺點

1.3 Hive架構原理

1.4 Hive和數據庫比較

1.4.1 查詢語言

1.4.2 數據存儲位置

1.4.3 數據更新

1.4.4 索引

1.4.5 執行

1.4.6 執行延遲

1.4.7 可擴展性

1.4.8 數據規模

第2章 Hive安裝

2.1 Hive安裝地址

2.2 Hive安裝部署

2.3 將本地文件導入Hive案例

2.4 MySql安裝

2.4.1 安裝包準備

2.4.2 安裝MySql服務器

2.4.3 安裝MySql客戶端

2.4.4 MySql中user表中主機配置

2.5 Hive元數據配置到MySql

2.5.1 驅動拷貝

2.5.2 配置Metastore到MySql

2.5.3 多窗口啓動Hive測試

2.6 HiveJDBC訪問

2.6.1 啓動hiveserver2服務

2.6.2 啓動beeline

2.6.3 鏈接hiveserver2

2.7 Hive經常使用交互命令

2.8 Hive其餘命令操做

2.9 Hive常見屬性配置

2.9.1 Hive數據倉庫位置配置

2.9.2 查詢後信息顯示配置

2.9.3 Hive運行日誌信息配置

2.9.4 參數配置方式

第3章 Hive數據類型

3.1 基本數據類型

3.2 集合數據類型

3.3 類型轉化

第4章 DDL數據定義

4.1 建立數據庫

4.2 查詢數據庫

4.2.1 顯示數據庫

4.2.2 查看數據庫詳情

4.3.3 切換當前數據庫

4.3 修改數據庫

4.4 刪除數據庫

4.5 建立表

4.5.1 管理表

4.5.2 外部表

4.5.3 管理表與外部表的互相轉換

4.6 分區表

4.6.1 分區表基本操做

4.6.2 分區表注意事項

4.7 修改表

4.7.1 重命名錶

4.7.2 增長、修改和刪除表分區

4.7.3 增長/修改/替換列信息

4.8 刪除表

第5章 DML數據操做

5.1 數據導入

5.1.1 向表中裝載數據（Load）

5.1.2 經過查詢語句向表中插入數據（Insert）

5.1.3 查詢語句中建立表並加載數據（As Select）

5.1.4 建立表時經過Location指定加載數據路徑

5.1.5 Import數據到指定Hive表中

5.2 數據導出

5.2.1 Insert導出

5.2.2 Hadoop命令導出到本地

5.2.3 Hive Shell 命令導出

5.2.4 Export導出到HDFS上

5.2.5 Sqoop導出

5.3 清除表中數據（Truncate）

第6章 查詢

6.1 基本查詢（Select…From）

第6章查詢

第7章函數

第8章壓縮和存儲

第9章企業級調優