Hadoop 上Hive 的操做

時間 2020-07-11

標籤 hadoop hive 欄目 Hadoop 简体版

原文原文鏈接

數據dept表的準備：node

--建立dept表
CREATE TABLE dept( deptno int, dname string, loc string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  STORED AS textfile;

數據文件準備：mysql

vi detp.txt 10,ACCOUNTING,NEW YORK 20,RESEARCH,DALLAS 30,SALES,CHICAGO 40,OPERATIONS,BOSTON

數據表emp準備：sql

CREATE TABLE emp( empno int, ename string, job string, mgr int, hiredate string, sal int, comm int, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;

表emp數據準備：oop

vi emp.txt 7369,SMITH,CLERK,7902,1980-12-17,800,null,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,null,20
7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01,2850,null,30
7782,CLARK,MANAGER,7839,1981-06-09,2450,null,10
7788,SCOTT,ANALYST,7566,1987-04-19,3000,null,20
7839,KING,PRESIDENT,null,1981-11-17,5000,null,10
7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23,1100,null,20
7900,JAMES,CLERK,7698,1981-12-03,950,null,30
7902,FORD,ANALYST,7566,1981-12-02,3000,null,20
7934,MILLER,CLERK,7782,1982-01-23,1300,null,10

把數據文件裝到表裏ui

load data local inpath '/home/hadoop/tmp/dept.txt' overwrite into table dept; load data local inpath '/home/hadoop/tmp/emp.txt' overwrite into table emp;

查詢語句this

select d.dname,d.loc,e.empno,e.ename,e.hiredate from dept d join emp e on e.deptno = d.deptno ; * 能夠看到走的是map reduce 程序

2、Hive分區
hive分區的目的
* hive爲了不全表掃描，從而引進分區技術來將數據進行劃分。減小沒必要要數據的掃描，從而提升效率。spa

hive分區和mysql分區的區別
* mysql分區字段用的是表內字段；而hive分區字段採用表外字段。
日誌

hive的分區技術
* hive的分區字段是一個僞字段，可是能夠用來進行操做。
* 分區字段不進行區分大小寫
* 分區能夠是表分區或者分區的分區，能夠有多個分區

hive分區根據
* 看業務，只要是某個標識能把數據區分開來。好比：年、月、日、地域、性別等

分區關鍵字
* partitioned by(字段)

分區本質
* 在表的目錄或者是分區的目錄下在建立目錄，分區的目錄名爲指定字段=值code

建立分區表：orm

create table if not exists u1( id int, name string, age int ) partitioned by(dt string) row format delimited fields terminated by ' ' 
stored as textfile;

數據準備：

[hadoop@master tmp]$ more u1.txt 1 xm1 16
2 xm2 18
3 xm3 22

加載數據：

load data local inpath '/home/hadoop/tmp/u1.txt'  into table  u1 partition(dt="2018-10-14");

查詢：

hive> select * from u1; OK 1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14 Time taken: 5.919 seconds, Fetched: 3 row(s)

查詢分區：

hive> select * from u1 where dt='2018-10-15'; OK 1       xm1     16      2018-10-15
2       xm2     18      2018-10-15
3       xm3     22      2018-10-15 Time taken: 0.413 seconds, Fetched: 3 row(s)

Hive的二級分區

建立表u2

create table if not exists u2(id int,name string,age int) partitioned by(month int,day int) row format delimited fields terminated by ' ' stored as textfile;

導入數據：

load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);

數據查詢：

hive> select * from u2; OK 1       xm1     16      9       14
2       xm2     18      9       14 Time taken: 0.303 seconds, Fetched: 2 row(s)

分區修改：

查看分區：

hive>  show partitions u1;
OK
dt=2018-10-14
dt=2018-10-15

增長分區：

> alter table u1 add partition(dt="2018-10-16"); OK

查看新增長的分區：

hive> show partitions u1; OK dt=2018-10-14 dt=2018-10-15 dt=2018-10-16 Time taken: 0.171 seconds, Fetched: 3 row(s)

刪除分區：

hive> alter table u1 drop partition(dt="2018-10-15"); Dropped the partition dt=2018-10-15 OK Time taken: 0.576 seconds hive>  select * from u1 ; OK 1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14 Time taken: 0.321 seconds, Fetched: 3 row(s)

3、hive動態分區

hive配置文件hive-site.xml 文件裏有配置參數：

hive.exec.dynamic.partition=true; 是否容許動態分區 hive.exec.dynamic.partition.mode=strict/nostrict; 動態區模式爲嚴格模式   strict: 嚴格模式，最少須要一個靜態分區列(需指定固定值)  nostrict:非嚴格模式，容許全部的分區字段都爲動態。 hive.exec.max.dynamic.partitions=1000; 容許最大的動態分區 hive.exec.max.dynamic.partitions.pernode=100; 單個節點容許最大分區

建立動態分區表

動態分區表的建立語句與靜態分區表相同，不一樣之處在與導入數據，靜態分區表能夠從本地文件導入，可是動態分區表須要使用from…insert into語句導入。

create table if not exists u3(id int,name string,age int) partitioned by(month int,day int) 
row format delimited fields terminated by ' ' stored as textfile;

導入數據，將u2表中的數據加載到u3中：

from u2 insert into table u3 partition(month,day) select id,name,age,month,day;

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
解決方法：

要動態插入分區必需設置hive.exec.dynamic.partition.mode=nonstrict
hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
hive> set hive.exec.dynamic.partition.mode=nonstrict;

而後再次插入就能夠了

查詢：

hive> select * from u3; OK 1       xm1     16      9       14
2       xm2     18      9       14 Time taken: 0.451 seconds, Fetched: 2 row(s)

hive分桶

分桶目的做用
* 更加細緻地劃分數據；對數據進行抽樣查詢，較爲高效；可使查詢效率提升

* 記住，分桶比分區，更高的查詢效率。
分桶原理關鍵字
* 分桶字段是表內字段，默認是對分桶的字段進行hash值，而後再模於總的桶數，獲得的值則是分區桶數。每一個桶中都有數據，但每一個桶中的數據條數不必定相等。
bucket
clustered by(id) into 4 buckets
分桶的本質
* 在表目錄或者分區目錄中建立文件。

分桶案例
* 分四個桶

create table if not exists u4(id int, name string, age int) partitioned by(month int,day int) clustered by(id) into 4 buckets row format delimited fields terminated by ' ' stored as textfile;

對分桶的數據不能使用load的方式加載數據，使用load方式加載不會報錯，可是沒有分桶的效果。

爲分桶表添加數據，須要設置set hive.enforce.bucketing=true;

首先將數據添加到u2表中

1 xm1 16
2 xm2 18
3 xm3 22
4 xh4 20
5 xh5 22
6 xh6 23
7 xh7 25
8 xh8 28
9 xh9 32

load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);

加載到桶表中：

from u2 insert into table u4 partition(month=9,day=14) select id,name,age  where month = 9  and day = 14;

2019-03-31 15:43:26,755 Stage-1 map = 0%,  reduce = 0%
2019-03-31 15:43:34,241 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.85 sec 2019-03-31 15:43:41,681 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 1.95 sec 2019-03-31 15:43:45,855 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 3.21 sec 2019-03-31 15:43:47,927 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 4.35 sec 2019-03-31 15:43:48,959 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.35 sec MapReduce Total cumulative CPU time: 5 seconds 350 msec Ended Job = job_1554061731326_0001 Loading data to table db_hive.u4 partition (month=9, day=14) MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 5.35 sec   HDFS Read: 20301 HDFS Write: 405 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 350 msec

加載日誌能夠看到有：Map: 1 Reduce: 4

對分桶進行查詢：tablesample(bucket x out of y on id)
* x:表示從哪一個桶開始查詢
* y:表示桶的總數,通常爲桶的總數的倍數或者因子。
* x不能大於y。

hive> select * from u4; OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.148 seconds, Fetched: 9 row(s)

> select * from u4 tablesample(bucket 1 out of 4 on id); OK 8       xh8     28      9       14
4       xh4     20      9       14 Time taken: 0.149 seconds, Fetched: 2 row(s) hive> select * from u4 tablesample(bucket 2 out of 4 on id); OK 9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14 Time taken: 0.069 seconds, Fetched: 3 row(s) hive> select * from u4 tablesample(bucket 1 out of 2 on id); OK 8       xh8     28      9       14
4       xh4     20      9       14
6       xh6     23      9       14
2       xm2     18      9       14 Time taken: 0.089 seconds, Fetched: 4 row(s) hive> select * from u4 tablesample(bucket 1 out of 8 on id) where age > 22; OK 8       xh8     28      9       14 Time taken: 0.075 seconds, Fetched: 1 row(s)

隨機查詢：

select * from u4 order by rand() limit 3;

OK
1       xm1     16      9       14
3       xm3     22      9       14
6       xh6     23      9       14
Time taken: 20.724 seconds, Fetched: 3 row(s) --走map reduce任務

> select * from u4 tablesample(3 rows); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14 Time taken: 0.073 seconds, Fetched: 3 row(s)

hive> select * from u4 tablesample(30 percent); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14 Time taken: 0.058 seconds, Fetched: 3 row(s)

> select * from u4 tablesample(3G); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.069 seconds, Fetched: 9 row(s)

hive> select * from u4 tablesample(3K); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.058 seconds, Fetched: 9 row(s)

* 分區與分桶的對比
* 分區使用表外的字段，分桶使用表內字段
* 分區可使用load加載數據，而分桶就必需要使用insert into方式加載數據
* 分區經常使用；分桶少用

hive數據導入

* load從本地加載
* load從hdfs中加載
* insert into方式加載
* location指定
* like指定，克隆
* ctas語句指定(create table as)
* 手動將數據copy到表目錄

hive數據導出
* insert into方式導出
* insert overwrite local directory:導出到本地某個目錄
* insert overwrite directory:導出到hdfs某個目錄

導出到文件

hive -S -e 「use gp1801;select * from u2」 > /home/out/02/result

相關標籤/搜索

hadoop+spark+hive+mysql

hadoop+hive+hbase+spark

hadoop+hive+spark+hbase

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。