文章做者:foochane
原文連接:https://foochane.cn/article/2019062501.htmlhtml
Hive
是一個能夠將SQL
翻譯爲MR
程序的工具,支持用戶將HDFS
上的文件映射爲表結構,而後用戶就能夠輸入SQL
對這些表(HDFS
上的文件)進行查詢分析。Hive
將用戶定義的庫、表結構等信息存儲hive
的元數據庫(能夠是本地derby
,也能夠是遠程mysql
)中。java
MR
程序,只須要寫SQL
腳本便可hive 2
之後 把底層引擎從MapReduce
換成了Spark
mysql
啓動hive
前要先啓動hdfs
和yarn
sql
輸入命令 $ hive
便可:shell
hadoop@Master:~$ hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/bigdata/hive-2.3.5/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/bigdata/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in file:/usr/local/bigdata/hive-2.3.5/conf/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive>show databases; OK dbtest default Time taken: 3.539 seconds, Fetched: 2 row(s) hive>
技巧:
讓提示符顯示當前庫:數據庫
hive>set hive.cli.print.current.db=true;
顯示查詢結果是顯示自帶名稱:apache
hive>set hive.cli.print.header=true;
這樣設置只是對當前窗口有效,永久生效能夠在當前用戶目錄下建一個.hiverc
文件。
加入以下內容:json
set hive.cli.print.current.db=true; set hive.cli.print.header=true;
將hive啓動爲一個服務端,而後能夠在任意一臺機器上使用beeline客戶端鏈接hive服務,進行交互式查詢數組
hive是一個單機的服務端能夠在任何一臺機器裏安裝,它訪問的是hdfs集羣。session
啓動hive服務 :
$ nohup hiveserver2 1>/dev/null 2>&1 &
啓動後,能夠用beeline去鏈接,beeline是一個客戶端,能夠在任意機器啓動,只要可以跟hive服務端相連便可。
在本地啓動beeline
$ beeline -u jdbc:hive2://localhost:10000 -n hadoop -p hadoop
在啓動機器上啓動beeline
$ beeline -u jdbc:hive2://Master:10000 -n hadoop -p hadoop
示例:
hadoop@Master:~$ beeline -u jdbc:hive2://Master:10000 -n hadoop -p hadoop Connecting to jdbc:hive2://Master:10000 19/06/25 01:50:12 INFO jdbc.Utils: Supplied authorities: Master:10000 19/06/25 01:50:12 INFO jdbc.Utils: Resolved authority: Master:10000 19/06/25 01:50:13 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://Master:10000 Connected to: Apache Hive (version 2.3.5) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1.spark2 by Apache Hive 0: jdbc:hive2://Master:10000>
errorMessage:Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: hadoop is not allowed to impersonate hadoop), serverProtocolVersion:null)
在 hadoop配置文件中的core-site.xml 文件中添加以下內容,而後重啓hadoop集羣:
<property> <name>hadoop.proxyuser.hadoop.groups</name> <value>hadoop</value> <description>Allow the superuser oozie to impersonate any members of the group group1 and group2</description> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>Master,127.0.0.1,localhost</value> <description>The superuser can connect only from host1 and host2 to impersonate a user</description> </property>
接用 hive -e
在命令行中運行sql
命令,該命令能夠一塊兒運行多條sql
語句,用;
隔開。
hive -e "sql1;sql2;sql3;sql4"
另外,還可使用 hive -f
命令。
事先將sql語句寫入一個文件好比 q.hql
,而後用hive -f
命令執行:
bin/hive -f q.hql
能夠將方式3
寫入一個xxx.sh
腳本中,而後運行該腳本。
create database db1;
示例:
0: jdbc:hive2://Master:10000> create database db1; No rows affected (1.123 seconds) 0: jdbc:hive2://Master:10000> show databases; +----------------+--+ | database_name | +----------------+--+ | db1 | | dbtest | | default | +----------------+--+
成功後,hive就會在/user/hive/warehouse/
下建一個文件夾: db1.db
drop database db1;
示例:
0: jdbc:hive2://Master:10000> drop database db1; No rows affected (0.969 seconds) 0: jdbc:hive2://Master:10000> show databases; +----------------+--+ | database_name | +----------------+--+ | dbtest | | default | +----------------+--+
use db1; create table t_test(id int,name string,age int) row format delimited fields terminated by ',';
示例:
0: jdbc:hive2://Master:10000> use db1; No rows affected (0.293 seconds) 0: jdbc:hive2://Master:10000> create table t_test(id int,name string,age int) 0: jdbc:hive2://Master:10000> row format delimited 0: jdbc:hive2://Master:10000> fields terminated by ','; No rows affected (1.894 seconds) 0: jdbc:hive2://Master:10000> desc db1.t_test; +-----------+------------+----------+--+ | col_name | data_type | comment | +-----------+------------+----------+--+ | id | int | | | name | string | | | age | int | | +-----------+------------+----------+--+ 3 rows selected (0.697 seconds)
建表後,hive會在倉庫目錄中建一個表目錄: /user/hive/warehouse/db1.db/t_test
create external table t_test1(id int,name string,age int) row format delimited fields terminated by ',' location '/user/hive/external/t_test1';
這裏的location
指的是hdfs
上的目錄,能夠直接在該目錄下放入相應格式的文件,就能夠在hive
表中查看到。
示例:
0: jdbc:hive2://Master:10000> create external table t_test1(id int,name string,age int) 0: jdbc:hive2://Master:10000> row format delimited 0: jdbc:hive2://Master:10000> fields terminated by ',' 0: jdbc:hive2://Master:10000> location '/user/hive/external/t_test1'; No rows affected (0.7 seconds) 0: jdbc:hive2://Master:10000> desc db1.t_test1; +-----------+------------+----------+--+ | col_name | data_type | comment | +-----------+------------+----------+--+ | id | int | | | name | string | | | age | int | | +-----------+------------+----------+--+ 3 rows selected (0.395 seconds)
本地建立測試文件user.data
1,xiaowang,28 2,xiaoli,18 3,xiaohong,23
放入hdfs中:
$ hdfs dfs -mkdir -p /user/hive/external/t_test1 $ hdfs dfs -put ./user.data /user/hive/external/t_test1
此時在hive表中就能夠查看到數據:
0: jdbc:hive2://Master:10000> select * from db1.t_test1; +-------------+---------------+--------------+--+ | t_test1.id | t_test1.name | t_test1.age | +-------------+---------------+--------------+--+ | 1 | xiaowang | 28 | | 2 | xiaoli | 18 | | 3 | xiaohong | 23 | +-------------+---------------+--------------+--+ 3 rows selected (8 seconds)
注意:若是刪除外部表,hdfs裏的文件並不會刪除
也就是若是包db1.t_test1
刪除,hdfs下/user/hive/external/t_test1/user.data
文件並不會被刪除。
本質上就是把數據文件放入表目錄;
能夠用hive命令來作:
load data [local] inpath '/data/path' [overwrite] into table t_test;
加local
表明導入本地數據。
導入本地數據
load data local inpath '/home/hadoop/user.data' into table t_test;
示例:
0: jdbc:hive2://Master:10000> load data local inpath '/home/hadoop/user.data' into table t_test; No rows affected (2.06 seconds) 0: jdbc:hive2://Master:10000> select * from db1.t_test; +------------+--------------+-------------+--+ | t_test.id | t_test.name | t_test.age | +------------+--------------+-------------+--+ | 1 | xiaowang | 28 | | 2 | xiaoli | 18 | | 3 | xiaohong | 23 | +------------+--------------+-------------+--+
導入hdfs中的數據
load data inpath '/user/hive/external/t_test1/user.data' into table t_test;
示例:
0: jdbc:hive2://Master:10000> load data inpath '/user/hive/external/t_test1/user.data' into table t_test; No rows affected (1.399 seconds) 0: jdbc:hive2://Master:10000> select * from db1.t_test; +------------+--------------+-------------+--+ | t_test.id | t_test.name | t_test.age | +------------+--------------+-------------+--+ | 1 | xiaowang | 28 | | 2 | xiaoli | 18 | | 3 | xiaohong | 23 | | 1 | xiaowang | 28 | | 2 | xiaoli | 18 | | 3 | xiaohong | 23 | +------------+--------------+-------------+--+ 6 rows selected (0.554 seconds)
注意:從本地導入數據,本地數據不是發生變化,從hdfs中導入數據,hdfs中的導入的文件會被移動到數據倉庫相應的目錄下
分區的意義在於能夠將數據分子目錄存儲,以便於查詢時讓數據讀取範圍更精準
create table t_test1(id int,name string,age int,create_time bigint) partitioned by (day string,country string) row format delimited fields terminated by ',';
插入數據到指定分區:
> load data [local] inpath '/data/path1' [overwrite] into table t_test partition(day='2019-06-04',country='China'); > load data [local] inpath '/data/path2' [overwrite] into table t_test partition(day='2019-06-05',country='China'); > load data [local] inpath '/data/path3' [overwrite] into table t_test partition(day='2019-06-04',country='England');
導入完成後,造成的目錄結構以下:
/user/hive/warehouse/db1.db/t_test1/day=2019-06-04/country=China/... /user/hive/warehouse/db1.db/t_test1/day=2019-06-04/country=England/... /user/hive/warehouse/db1.db/t_test1/day=2019-06-05/country=China/...
select * from t_table where a<1000 and b>0;
各種join
測試數據:
a.txt:
a,1 b,2 c,3 d,4
b.txt:
b,16 c,17 d,18 e,19
建表導入數據:
create table t_a(name string,num int) row format delimited fields terminated by ','; create table t_b(name string,age int) row format delimited fields terminated by ','; load data local inpath '/home/hadoop/a.txt' into table t_a; load data local inpath '/home/hadoop/b.txt' into table t_b;
表數據以下:
0: jdbc:hive2://Master:10000> select * from t_a; +-----------+----------+--+ | t_a.name | t_a.num | +-----------+----------+--+ | a | 1 | | b | 2 | | c | 3 | | d | 4 | +-----------+----------+--+ 4 rows selected (0.523 seconds) 0: jdbc:hive2://Master:10000> select * from t_b; +-----------+----------+--+ | t_b.name | t_b.age | +-----------+----------+--+ | b | 16 | | c | 17 | | d | 18 | | e | 19 | +-----------+----------+--+ 4 rows selected (0.482 seconds)
指定join條件
select a.*,b.* from t_a a join t_b b on a.name=b.name;
示例:
0: jdbc:hive2://Master:10000> select a.*,b.* 0: jdbc:hive2://Master:10000> from 0: jdbc:hive2://Master:10000> t_a a join t_b b on a.name=b.name; .... +---------+--------+---------+--------+--+ | a.name | a.num | b.name | b.age | +---------+--------+---------+--------+--+ | b | 2 | b | 16 | | c | 3 | c | 17 | | d | 4 | d | 18 | +---------+--------+---------+--------+--+
select a.*,b.* from t_a a left outer join t_b b on a.name=b.name;
示例:
0: jdbc:hive2://Master:10000> select a.*,b.* 0: jdbc:hive2://Master:10000> from 0: jdbc:hive2://Master:10000> t_a a left outer join t_b b on a.name=b.name; ... +---------+--------+---------+--------+--+ | a.name | a.num | b.name | b.age | +---------+--------+---------+--------+--+ | a | 1 | NULL | NULL | | b | 2 | b | 16 | | c | 3 | c | 17 | | d | 4 | d | 18 | +---------+--------+---------+--------+--+
select a.*,b.* from t_a a right outer join t_b b on a.name=b.name;
示例:
0: jdbc:hive2://Master:10000> select a.*,b.* 0: jdbc:hive2://Master:10000> from 0: jdbc:hive2://Master:10000> t_a a right outer join t_b b on a.name=b.name; .... +---------+--------+---------+--------+--+ | a.name | a.num | b.name | b.age | +---------+--------+---------+--------+--+ | b | 2 | b | 16 | | c | 3 | c | 17 | | d | 4 | d | 18 | | NULL | NULL | e | 19 | +---------+--------+---------+--------+--+
select a.*,b.* from t_a a full outer join t_b b on a.name=b.name;
示例:
0: jdbc:hive2://Master:10000> select a.*,b.* 0: jdbc:hive2://Master:10000> from 0: jdbc:hive2://Master:10000> t_a a full outer join t_b b on a.name=b.name; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. +---------+--------+---------+--------+--+ | a.name | a.num | b.name | b.age | +---------+--------+---------+--------+--+ | a | 1 | NULL | NULL | | b | 2 | b | 16 | | c | 3 | c | 17 | | d | 4 | d | 18 | | NULL | NULL | e | 19 | +---------+--------+---------+--------+--+
求存在於a表,且b表裏也存在的數據。
select a.* from t_a a left semi join t_b b on a.name=b.name;
示例:
0: jdbc:hive2://Master:10000> select a.* 0: jdbc:hive2://Master:10000> from 0: jdbc:hive2://Master:10000> t_a a left semi join t_b b on a.name=b.name; ..... +---------+--------+--+ | a.name | a.num | +---------+--------+--+ | b | 2 | | c | 3 | | d | 4 | +---------+--------+--+
構建測試數據
192.168.33.3,http://www.xxx.cn/stu,2019-08-04 15:30:20 192.168.33.3,http://www.xxx.cn/teach,2019-08-04 15:35:20 192.168.33.4,http://www.xxx.cn/stu,2019-08-04 15:30:20 192.168.33.4,http://www.xxx.cn/job,2019-08-04 16:30:20 192.168.33.5,http://www.xxx.cn/job,2019-08-04 15:40:20 192.168.33.3,http://www.xxx.cn/stu,2019-08-05 15:30:20 192.168.44.3,http://www.xxx.cn/teach,2019-08-05 15:35:20 192.168.33.44,http://www.xxx.cn/stu,2019-08-05 15:30:20 192.168.33.46,http://www.xxx.cn/job,2019-08-05 16:30:20 192.168.33.55,http://www.xxx.cn/job,2019-08-05 15:40:20 192.168.133.3,http://www.xxx.cn/register,2019-08-06 15:30:20 192.168.111.3,http://www.xxx.cn/register,2019-08-06 15:35:20 192.168.34.44,http://www.xxx.cn/pay,2019-08-06 15:30:20 192.168.33.46,http://www.xxx.cn/excersize,2019-08-06 16:30:20 192.168.33.55,http://www.xxx.cn/job,2019-08-06 15:40:20 192.168.33.46,http://www.xxx.cn/excersize,2019-08-06 16:30:20 192.168.33.25,http://www.xxx.cn/job,2019-08-06 15:40:20 192.168.33.36,http://www.xxx.cn/excersize,2019-08-06 16:30:20 192.168.33.55,http://www.xxx.cn/job,2019-08-06 15:40:20
建分區表,導入數據:
create table t_pv(ip string,url string,time string) partitioned by (dt string) row format delimited fields terminated by ','; load data local inpath '/home/hadoop/pv.log.0804' into table t_pv partition(dt='2019-08-04'); load data local inpath '/home/hadoop/pv.log.0805' into table t_pv partition(dt='2019-08-05'); load data local inpath '/home/hadoop/pv.log.0806' into table t_pv partition(dt='2019-08-06');
查看數據:
0: jdbc:hive2://Master:10000> select * from t_pv; +----------------+------------------------------+----------------------+-------------+--+ | t_pv.ip | t_pv.url | t_pv.time | t_pv.dt | +----------------+------------------------------+----------------------+-------------+--+ | 192.168.33.3 | http://www.xxx.cn/stu | 2019-08-04 15:30:20 | 2019-08-04 | | 192.168.33.3 | http://www.xxx.cn/teach | 2019-08-04 15:35:20 | 2019-08-04 | | 192.168.33.4 | http://www.xxx.cn/stu | 2019-08-04 15:30:20 | 2019-08-04 | | 192.168.33.4 | http://www.xxx.cn/job | 2019-08-04 16:30:20 | 2019-08-04 | | 192.168.33.5 | http://www.xxx.cn/job | 2019-08-04 15:40:20 | 2019-08-05 | | 192.168.33.3 | http://www.xxx.cn/stu | 2019-08-05 15:30:20 | 2019-08-05 | | 192.168.44.3 | http://www.xxx.cn/teach | 2019-08-05 15:35:20 | 2019-08-05 | | 192.168.33.44 | http://www.xxx.cn/stu | 2019-08-05 15:30:20 | 2019-08-05 | | 192.168.33.46 | http://www.xxx.cn/job | 2019-08-05 16:30:20 | 2019-08-05 | | 192.168.33.55 | http://www.xxx.cn/job | 2019-08-05 15:40:20 | 2019-08-06 | | 192.168.133.3 | http://www.xxx.cn/register | 2019-08-06 15:30:20 | 2019-08-06 | | 192.168.111.3 | http://www.xxx.cn/register | 2019-08-06 15:35:20 | 2019-08-06 | | 192.168.34.44 | http://www.xxx.cn/pay | 2019-08-06 15:30:20 | 2019-08-06 | | 192.168.33.46 | http://www.xxx.cn/excersize | 2019-08-06 16:30:20 | 2019-08-06 | | 192.168.33.55 | http://www.xxx.cn/job | 2019-08-06 15:40:20 | 2019-08-06 | | 192.168.33.46 | http://www.xxx.cn/excersize | 2019-08-06 16:30:20 | 2019-08-06 | | 192.168.33.25 | http://www.xxx.cn/job | 2019-08-06 15:40:20 | 2019-08-06 | | 192.168.33.36 | http://www.xxx.cn/excersize | 2019-08-06 16:30:20 | 2019-08-06 | | 192.168.33.55 | http://www.xxx.cn/job | 2019-08-06 15:40:20 | 2019-08-06 | +----------------+------------------------------+----------------------+-------------+--+
查看錶分區:
show partitions t_pv;
0: jdbc:hive2://Master:10000> show partitions t_pv; +----------------+--+ | partition | +----------------+--+ | dt=2019-08-04 | | dt=2019-08-05 | | dt=2019-08-06 | +----------------+--+ 3 rows selected (0.575 seconds)
select ip,upper(url),time from t_pv
0: jdbc:hive2://Master:10000> select ip,upper(url),time 0: jdbc:hive2://Master:10000> from t_pv +----------------+------------------------------+----------------------+--+ | ip | _c1 | time | +----------------+------------------------------+----------------------+--+ | 192.168.33.3 | HTTP://WWW.XXX.CN/STU | 2019-08-04 15:30:20 | | 192.168.33.3 | HTTP://WWW.XXX.CN/TEACH | 2019-08-04 15:35:20 | | 192.168.33.4 | HTTP://WWW.XXX.CN/STU | 2019-08-04 15:30:20 | | 192.168.33.4 | HTTP://WWW.XXX.CN/JOB | 2019-08-04 16:30:20 | | 192.168.33.5 | HTTP://WWW.XXX.CN/JOB | 2019-08-04 15:40:20 | | 192.168.33.3 | HTTP://WWW.XXX.CN/STU | 2019-08-05 15:30:20 | | 192.168.44.3 | HTTP://WWW.XXX.CN/TEACH | 2019-08-05 15:35:20 | | 192.168.33.44 | HTTP://WWW.XXX.CN/STU | 2019-08-05 15:30:20 | | 192.168.33.46 | HTTP://WWW.XXX.CN/JOB | 2019-08-05 16:30:20 | | 192.168.33.55 | HTTP://WWW.XXX.CN/JOB | 2019-08-05 15:40:20 | | 192.168.133.3 | HTTP://WWW.XXX.CN/REGISTER | 2019-08-06 15:30:20 | | 192.168.111.3 | HTTP://WWW.XXX.CN/REGISTER | 2019-08-06 15:35:20 | | 192.168.34.44 | HTTP://WWW.XXX.CN/PAY | 2019-08-06 15:30:20 | | 192.168.33.46 | HTTP://WWW.XXX.CN/EXCERSIZE | 2019-08-06 16:30:20 | | 192.168.33.55 | HTTP://WWW.XXX.CN/JOB | 2019-08-06 15:40:20 | | 192.168.33.46 | HTTP://WWW.XXX.CN/EXCERSIZE | 2019-08-06 16:30:20 | | 192.168.33.25 | HTTP://WWW.XXX.CN/JOB | 2019-08-06 15:40:20 | | 192.168.33.36 | HTTP://WWW.XXX.CN/EXCERSIZE | 2019-08-06 16:30:20 | | 192.168.33.55 | HTTP://WWW.XXX.CN/JOB | 2019-08-06 15:40:20 | +----------------+------------------------------+----------------------+--+
select url ,count(1) --對分好組的數據進行逐行運算 from t_pv group by url;
0: jdbc:hive2://Master:10000> select url ,count(1) 0: jdbc:hive2://Master:10000> from t_pv 0: jdbc:hive2://Master:10000> group by url; ····· +------------------------------+------+--+ | url | _c1 | +------------------------------+------+--+ | http://www.xxx.cn/excersize | 3 | | http://www.xxx.cn/job | 7 | | http://www.xxx.cn/pay | 1 | | http://www.xxx.cn/register | 2 | | http://www.xxx.cn/stu | 4 | | http://www.xxx.cn/teach | 2 | +------------------------------+------+--+
能夠給_c1加入字段名稱:
select url ,count(1) as count from t_pv group by url;
select url,max(ip) from t_pv group by url;
0: jdbc:hive2://Master:10000> select url,max(ip) 0: jdbc:hive2://Master:10000> from t_pv 0: jdbc:hive2://Master:10000> group by url; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. +------------------------------+----------------+--+ | url | _c1 | +------------------------------+----------------+--+ | http://www.xxx.cn/excersize | 192.168.33.46 | | http://www.xxx.cn/job | 192.168.33.55 | | http://www.xxx.cn/pay | 192.168.34.44 | | http://www.xxx.cn/register | 192.168.133.3 | | http://www.xxx.cn/stu | 192.168.33.44 | | http://www.xxx.cn/teach | 192.168.44.3 | +------------------------------+----------------+--+
select ip,url,max(time) from t_pv group by ip,url;
0: jdbc:hive2://Master:10000> select ip,url,max(time) 0: jdbc:hive2://Master:10000> from t_pv 0: jdbc:hive2://Master:10000> group by ip,url; ..... +----------------+------------------------------+----------------------+--+ | ip | url | _c2 | +----------------+------------------------------+----------------------+--+ | 192.168.111.3 | http://www.xxx.cn/register | 2019-08-06 15:35:20 | | 192.168.133.3 | http://www.xxx.cn/register | 2019-08-06 15:30:20 | | 192.168.33.25 | http://www.xxx.cn/job | 2019-08-06 15:40:20 | | 192.168.33.3 | http://www.xxx.cn/stu | 2019-08-05 15:30:20 | | 192.168.33.3 | http://www.xxx.cn/teach | 2019-08-04 15:35:20 | | 192.168.33.36 | http://www.xxx.cn/excersize | 2019-08-06 16:30:20 | | 192.168.33.4 | http://www.xxx.cn/job | 2019-08-04 16:30:20 | | 192.168.33.4 | http://www.xxx.cn/stu | 2019-08-04 15:30:20 | | 192.168.33.44 | http://www.xxx.cn/stu | 2019-08-05 15:30:20 | | 192.168.33.46 | http://www.xxx.cn/excersize | 2019-08-06 16:30:20 | | 192.168.33.46 | http://www.xxx.cn/job | 2019-08-05 16:30:20 | | 192.168.33.5 | http://www.xxx.cn/job | 2019-08-04 15:40:20 | | 192.168.33.55 | http://www.xxx.cn/job | 2019-08-06 15:40:20 | | 192.168.34.44 | http://www.xxx.cn/pay | 2019-08-06 15:30:20 | | 192.168.44.3 | http://www.xxx.cn/teach | 2019-08-05 15:35:20 | +----------------+------------------------------+----------------------+--+
select dt,'http://www.xxx.cn/job',count(1),max(ip) from t_pv where url='http://www.xxx.cn/job' group by dt having dt>'2019-08-04'; select dt,max(url),count(1),max(ip) from t_pv where url='http://www.xxx.cn/job' group by dt having dt>'2019-08-04'; select dt,url,count(1),max(ip) from t_pv where url='http://www.xxx.cn/job' group by dt,url having dt>'2019-08-04'; select dt,url,count(1),max(ip) from t_pv where url='http://www.xxx.cn/job' and dt>'2019-08-04' group by dt,url;
select dt,url,count(1),max(ip) from t_pv where dt>'2019-08-04' group by dt,url;
select dt,url,count(1) as cnts,max(ip) from t_pv where dt>'2019-08-04' group by dt,url having cnts>2;
select dt,url,cnts,max_ip from (select dt,url,count(1) as cnts,max(ip) as max_ip from t_pv where dt>'2019-08-04' group by dt,url) tmp where cnts>2;
TINYINT
(1-byte signed integer, from -128 to 127)SMALLINT
(2-byte signed integer, from -32,768 to 32,767)INT/INTEGER
(4-byte signed integer, from -2,147,483,648 to 2,147,483,647)BIGINT
(8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)FLOAT
(4-byte single precision floating point number)DOUBLE
(8-byte double precision floating point number)示例:
create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)
TIMESTAMP
(Note: Only available starting with Hive 0.8.0)DATE
(Note: Only available starting with Hive 0.12.0)示例,假若有如下數據文件:
1,zhangsan,1985-06-30 2,lisi,1986-07-10 3,wangwu,1985-08-09
那麼,就能夠建一個表來對數據進行映射
create table t_customer(id int,name string,birthday date) row format delimited fields terminated by ',';
而後導入數據
load data local inpath '/root/customer.dat' into table t_customer;
而後,就能夠正確查詢
STRING
VARCHAR
(Note: Only available starting with Hive 0.12.0)CHAR
(Note: Only available starting with Hive 0.13.0)BOOLEAN
BINARY
(Note: Only available starting with Hive 0.8.0)有以下數據:
玩具總動員4,湯姆·漢克斯:蒂姆·艾倫:安妮·波茨,2019-06-21 流浪地球,屈楚蕭:吳京:李光潔:吳孟達,2019-02-05 千與千尋,柊瑠美:入野自由:夏木真理:菅原文太,2019-06-21 戰狼2,吳京:弗蘭克·格里羅:吳剛:張翰:盧靖姍,2017-08-16
建表導入數據:
--建表映射: create table t_movie(movie_name string,actors array<string>,first_show date) row format delimited fields terminated by ',' collection items terminated by ':'; --導入數據 load data local inpath '/home/hadoop/actor.dat' into table t_movie;
0: jdbc:hive2://Master:10000> select * from t_movie; +---------------------+-----------------------------------+---------------------+--+ | t_movie.movie_name | t_movie.actors | t_movie.first_show | +---------------------+-----------------------------------+---------------------+--+ | 玩具總動員4 | ["湯姆·漢克斯","蒂姆·艾倫","安妮·波茨"] | 2019-06-21 | | 流浪地球 | ["屈楚蕭","吳京","李光潔","吳孟達"] | 2019-02-05 | | 千與千尋 | ["柊瑠美","入野自由","夏木真理","菅原文太"] | 2019-06-21 | | 戰狼2 | ["吳京","弗蘭克·格里羅","吳剛","張翰","盧靖姍"] | 2017-08-16 | +---------------------+-----------------------------------+---------------------+--+
select movie_name,actors[0],first_show from t_movie;
0: jdbc:hive2://Master:10000> select movie_name,actors[0],first_show from t_movie; +-------------+---------+-------------+--+ | movie_name | _c1 | first_show | +-------------+---------+-------------+--+ | 玩具總動員4 | 湯姆·漢克斯 | 2019-06-21 | | 流浪地球 | 屈楚蕭 | 2019-02-05 | | 千與千尋 | 柊瑠美 | 2019-06-21 | | 戰狼2 | 吳京 | 2017-08-16 | +-------------+---------+-------------+--+
select movie_name,actors,first_show from t_movie where array_contains(actors,'吳京');
0: jdbc:hive2://Master:10000> select movie_name,actors,first_show 0: jdbc:hive2://Master:10000> from t_movie where array_contains(actors,'吳京'); +-------------+-----------------------------------+-------------+--+ | movie_name | actors | first_show | +-------------+-----------------------------------+-------------+--+ | 流浪地球 | ["屈楚蕭","吳京","李光潔","吳孟達"] | 2019-02-05 | | 戰狼2 | ["吳京","弗蘭克·格里羅","吳剛","張翰","盧靖姍"] | 2017-08-16 | +-------------+-----------------------------------+-------------+--+
select movie_name ,size(actors) as actor_number ,first_show from t_movie;
0: jdbc:hive2://Master:10000> from t_movie; +-------------+---------------+-------------+--+ | movie_name | actor_number | first_show | +-------------+---------------+-------------+--+ | 玩具總動員4 | 3 | 2019-06-21 | | 流浪地球 | 4 | 2019-02-05 | | 千與千尋 | 4 | 2019-06-21 | | 戰狼2 | 5 | 2017-08-16 | +-------------+---------------+-------------+--+
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28 2,lisi,father:mayun#mother:huangyi#brother:guanyu,22 3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29 4,mayun,father:mayongzhen#mother:angelababy,26
導入數據
-- 建表映射上述數據 create table t_family(id int,name string,family_members map<string,string>,age int) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by ':'; -- 導入數據 load data local inpath '/root/hivetest/fm.dat' into table t_family;
0: jdbc:hive2://Master:10000> select * from t_family; +--------------+----------------+----------------------------------------------------------------+---------------+--+ | t_family.id | t_family.name | t_family.family_members | t_family.age | +--------------+----------------+----------------------------------------------------------------+---------------+--+ | 1 | zhangsan | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28 | | 2 | lisi | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22 | | 3 | wangwu | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29 | | 4 | mayun | {"father":"mayongzhen","mother":"angelababy"} | 26 | +--------------+----------------+----------------------------------------------------------------+---------------+--+
select id,name,family_members["father"] as father,family_members["sister"] as sister,age from t_family;
select id,name,map_keys(family_members) as relations,age from t_family;
select id,name,map_values(family_members) as relations,age from t_family;
select id,name,size(family_members) as relations,age from t_family;
-- 方案1:一句話寫完 select id,name,age,family_members['brother'] from t_family where array_contains(map_keys(family_members),'brother'); -- 方案2:子查詢 select id,name,age,family_members['brother'] from (select id,name,age,map_keys(family_members) as relations,family_members from t_family) tmp where array_contains(relations,'brother');
數據
1,zhangsan,18:male:深圳 2,lisi,28:female:北京 3,wangwu,38:male:廣州 4,laowang,26:female:上海 5,yangyang,35:male:杭州
導入數據:
-- 建表映射上述數據 drop table if exists t_user; create table t_user(id int,name string,info struct<age:int,sex:string,addr:string>) row format delimited fields terminated by ',' collection items terminated by ':'; -- 導入數據 load data local inpath '/home/hadoop/user.dat' into table t_user;
0: jdbc:hive2://Master:10000> select * from t_user; +------------+--------------+----------------------------------------+--+ | t_user.id | t_user.name | t_user.info | +------------+--------------+----------------------------------------+--+ | 1 | zhangsan | {"age":18,"sex":"male","addr":"深圳"} | | 2 | lisi | {"age":28,"sex":"female","addr":"北京"} | | 3 | wangwu | {"age":38,"sex":"male","addr":"廣州"} | | 4 | laowang | {"age":26,"sex":"female","addr":"上海"} | | 5 | yangyang | {"age":35,"sex":"male","addr":"杭州"} | +------------+--------------+----------------------------------------+--+
select id,name,info.addr from t_user;
0: jdbc:hive2://Master:10000> select id,name,info.addr 0: jdbc:hive2://Master:10000> from t_user; +-----+-----------+-------+--+ | id | name | addr | +-----+-----------+-------+--+ | 1 | zhangsan | 深圳 | | 2 | lisi | 北京 | | 3 | wangwu | 廣州 | | 4 | laowang | 上海 | | 5 | yangyang | 杭州 | +-----+-----------+-------+--+
測試函數
select substr("abcdef",1,3);
0: jdbc:hive2://Master:10000> select substr("abcdef",1,3); +------+--+ | _c0 | +------+--+ | abc | +------+--+
from_unixtime(21938792183,'yyyy-MM-dd HH:mm:ss')
返回: '2017-06-03 17:50:30'
select cast("8" as int); select cast("2019-2-3" as data)
substr("abcde",1,3) --> 'abc' concat('abc','def') --> 'abcdef'
0: jdbc:hive2://Master:10000> select substr("abcde",1,3); +------+--+ | _c0 | +------+--+ | abc | +------+--+ 1 row selected (0.152 seconds) 0: jdbc:hive2://Master:10000> select concat('abc','def'); +---------+--+ | _c0 | +---------+--+ | abcdef | +---------+--+ 1 row selected (0.165 seconds)
get_json_object('{\"key1\":3333,\"key2\":4444}' , '$.key1')
返回:3333
json_tuple('{\"key1\":3333,\"key2\":4444}','key1','key2') as(key1,key2)
返回:3333, 4444
parse_url_tuple('http://www.xxxx.cn/bigdata?userid=8888','HOST','PATH','QUERY','QUERY:userid')
返回: www.xxxx.cn /bigdata userid=8888 8888
測試數據以下:
1,zhangsan:18-1999063117:30:00-beijing 2,lisi:28-1989063117:30:00-shanghai 3,wangwu:20-1997063117:30:00-tieling
建表導入數據:
create table t_user_info(info string) row format delimited;
導入數據:
load data local inpath '/root/udftest.data' into table t_user_info;
需求:利用上表生成以下新表
t_user:uid,uname,age,birthday,address
思路:能夠自定義一個函數parse_user_info(),能傳入一行上述數據,返回切分好的字段
而後能夠經過以下sql完成需求:
create t_user as select parse_user_info(info,0) as uid, parse_user_info(info,1) as uname, parse_user_info(info,2) as age, parse_user_info(info,3) as birthday_date, parse_user_info(info,4) as birthday_time, parse_user_info(info,5) as address from t_user_info;
實現關鍵: 自定義parse_user_info() 函數
一、寫一個java類實現函數所須要的功能
public class UserInfoParser extends UDF{ // 1,zhangsan:18-1999063117:30:00-beijing public String evaluate(String line,int index) { String newLine = line.replaceAll(",", "\001").replaceAll(":", "\001").replaceAll("-", "\001"); StringBuilder sb = new StringBuilder(); String[] split = newLine.split("\001"); StringBuilder append = sb.append(split[0]) .append("\t") .append(split[1]) .append("\t") .append(split[2]) .append("\t") .append(split[3].substring(0, 8)) .append("\t") .append(split[3].substring(8, 10)).append(split[4]).append(split[5]) .append("\t") .append(split[6]); String res = append.toString(); return res.split("\t")[index]; } }
二、將java類打成jar包: d:/up.jar
三、上傳jar包到hive所在的機器上 /root/up.jar
四、在hive的提示符中添加jar包
hive> add jar /root/up.jar;
五、建立一個hive的自定義函數名 跟 寫好的jar包中的java類對應
hive> create temporary function parse_user_info as 'com.doit.hive.udf.UserInfoParser';