Hadoop 上Hive 的操做

數據dept表的準備:node

--建立dept表
CREATE TABLE dept( deptno int, dname string, loc string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','  STORED AS textfile;

數據文件準備:mysql

vi detp.txt 10,ACCOUNTING,NEW YORK 20,RESEARCH,DALLAS 30,SALES,CHICAGO 40,OPERATIONS,BOSTON

 

數據表emp準備:sql

CREATE TABLE emp( empno int, ename string, job string, mgr int, hiredate string, sal int, comm int, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;

 

表emp數據準備:oop

vi emp.txt 7369,SMITH,CLERK,7902,1980-12-17,800,null,20
7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02,2975,null,20
7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01,2850,null,30
7782,CLARK,MANAGER,7839,1981-06-09,2450,null,10
7788,SCOTT,ANALYST,7566,1987-04-19,3000,null,20
7839,KING,PRESIDENT,null,1981-11-17,5000,null,10
7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23,1100,null,20
7900,JAMES,CLERK,7698,1981-12-03,950,null,30
7902,FORD,ANALYST,7566,1981-12-02,3000,null,20
7934,MILLER,CLERK,7782,1982-01-23,1300,null,10

 

把數據文件裝到表裏ui

load data local inpath '/home/hadoop/tmp/dept.txt' overwrite into table dept; load data local inpath '/home/hadoop/tmp/emp.txt' overwrite into table emp;

 

查詢語句this

select d.dname,d.loc,e.empno,e.ename,e.hiredate from dept d join emp e on e.deptno = d.deptno ; * 能夠看到走的是map reduce 程序

 

2、Hive分區
hive分區的目的
  * hive爲了不全表掃描,從而引進分區技術來將數據進行劃分。減小沒必要要數據的掃描,從而提升效率。spa


hive分區和mysql分區的區別
  * mysql分區字段用的是表內字段;而hive分區字段採用表外字段。
 日誌


hive的分區技術
  * hive的分區字段是一個僞字段,可是能夠用來進行操做。
  * 分區字段不進行區分大小寫
  * 分區能夠是表分區或者分區的分區,能夠有多個分區


hive分區根據
  * 看業務,只要是某個標識能把數據區分開來。好比:年、月、日、地域、性別等



分區關鍵字
  * partitioned by(字段)


分區本質
  * 在表的目錄或者是分區的目錄下在建立目錄,分區的目錄名爲指定字段=值code

建立分區表:orm

create table if not exists u1( id int, name string, age int ) partitioned by(dt string) row format delimited fields terminated by ' ' 
stored as textfile;

數據準備:

[hadoop@master tmp]$ more u1.txt 1 xm1 16
2 xm2 18
3 xm3 22

加載數據:

load data local inpath '/home/hadoop/tmp/u1.txt'  into table  u1 partition(dt="2018-10-14");

查詢:

hive> select * from u1; OK 1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14 Time taken: 5.919 seconds, Fetched: 3 row(s)

 

 查詢分區:

hive> select * from u1 where dt='2018-10-15'; OK 1       xm1     16      2018-10-15
2       xm2     18      2018-10-15
3       xm3     22      2018-10-15 Time taken: 0.413 seconds, Fetched: 3 row(s)

 

Hive的二級分區

建立表u2

create table if not exists u2(id int,name string,age int) partitioned by(month int,day int) row format delimited fields terminated by ' ' stored as textfile;

導入數據:

load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);

 

 數據查詢:

hive> select * from u2; OK 1       xm1     16      9       14
2       xm2     18      9       14 Time taken: 0.303 seconds, Fetched: 2 row(s)

 

分區修改:

查看分區:

hive>  show partitions u1;
OK
dt=2018-10-14
dt=2018-10-15

 

增長分區:

> alter table u1 add partition(dt="2018-10-16"); OK

 

查看新增長的分區:

hive> show partitions u1; OK dt=2018-10-14 dt=2018-10-15 dt=2018-10-16 Time taken: 0.171 seconds, Fetched: 3 row(s)

 

刪除分區:

hive> alter table u1 drop partition(dt="2018-10-15"); Dropped the partition dt=2018-10-15 OK Time taken: 0.576 seconds hive>  select * from u1 ; OK 1       xm1     16      2018-10-14
2       xm2     18      2018-10-14
3       xm3     22      2018-10-14 Time taken: 0.321 seconds, Fetched: 3 row(s)

 

 3、hive動態分區

 hive配置文件hive-site.xml 文件裏有配置參數:

hive.exec.dynamic.partition=true; 是否容許動態分區 hive.exec.dynamic.partition.mode=strict/nostrict; 動態區模式爲嚴格模式 ​  strict: 嚴格模式,最少須要一個靜態分區列(需指定固定值) ​ nostrict:非嚴格模式,容許全部的分區字段都爲動態。 hive.exec.max.dynamic.partitions=1000; 容許最大的動態分區 hive.exec.max.dynamic.partitions.pernode=100; 單個節點容許最大分區

 

建立動態分區表

動態分區表的建立語句與靜態分區表相同,不一樣之處在與導入數據,靜態分區表能夠從本地文件導入,可是動態分區表須要使用from…insert into語句導入。

create table if not exists u3(id int,name string,age int) partitioned by(month int,day int) 
row format delimited fields terminated by ' ' stored as textfile;

導入數據,將u2表中的數據加載到u3中:

from u2 insert into table u3 partition(month,day) select id,name,age,month,day;

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
解決方法:

要動態插入分區必需設置hive.exec.dynamic.partition.mode=nonstrict
hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
hive>  set hive.exec.dynamic.partition.mode=nonstrict;

而後再次插入就能夠了

查詢:

hive> select * from u3; OK 1       xm1     16      9       14
2       xm2     18      9       14 Time taken: 0.451 seconds, Fetched: 2 row(s)

 

hive分桶

分桶目的做用
  * 更加細緻地劃分數據;對數據進行抽樣查詢,較爲高效;可使查詢效率提升

  * 記住,分桶比分區,更高的查詢效率。
分桶原理關鍵字
  * 分桶字段是表內字段,默認是對分桶的字段進行hash值,而後再模於總的桶數,獲得的值則是分區桶數。每一個桶中都有數據,但每一個桶中的數據條數不必定相等。
     bucket
     clustered by(id) into 4 buckets
分桶的本質
  * 在表目錄或者分區目錄中建立文件。

分桶案例
  * 分四個桶

 

create table if not exists u4(id int, name string, age int) partitioned by(month int,day int) clustered by(id) into 4 buckets row format delimited fields terminated by ' ' stored as textfile;

對分桶的數據不能使用load的方式加載數據,使用load方式加載不會報錯,可是沒有分桶的效果。

爲分桶表添加數據,須要設置set hive.enforce.bucketing=true;

首先將數據添加到u2表中

1 xm1 16
2 xm2 18
3 xm3 22
4 xh4 20
5 xh5 22
6 xh6 23
7 xh7 25
8 xh8 28
9 xh9 32

load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);

加載到桶表中:

from u2 insert into table u4 partition(month=9,day=14) select id,name,age  where month = 9  and day = 14;
2019-03-31 15:43:26,755 Stage-1 map = 0%,  reduce = 0%
2019-03-31 15:43:34,241 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.85 sec 2019-03-31 15:43:41,681 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 1.95 sec 2019-03-31 15:43:45,855 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 3.21 sec 2019-03-31 15:43:47,927 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 4.35 sec 2019-03-31 15:43:48,959 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.35 sec MapReduce Total cumulative CPU time: 5 seconds 350 msec Ended Job = job_1554061731326_0001 Loading data to table db_hive.u4 partition (month=9, day=14) MapReduce Jobs Launched: Stage-Stage-1: Map: 1  Reduce: 4   Cumulative CPU: 5.35 sec   HDFS Read: 20301 HDFS Write: 405 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 350 msec

加載日誌能夠看到有:Map: 1 Reduce: 4

對分桶進行查詢:tablesample(bucket x out of y on id)
  *  x:表示從哪一個桶開始查詢
  *  y:表示桶的總數,通常爲桶的總數的倍數或者因子。
  *  x不能大於y。

hive> select * from u4; OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.148 seconds, Fetched: 9 row(s)
> select * from u4 tablesample(bucket 1 out of 4 on id); OK 8       xh8     28      9       14
4       xh4     20      9       14 Time taken: 0.149 seconds, Fetched: 2 row(s) hive> select * from u4 tablesample(bucket 2 out of 4 on id); OK 9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14 Time taken: 0.069 seconds, Fetched: 3 row(s) hive> select * from u4 tablesample(bucket 1 out of 2 on id); OK 8       xh8     28      9       14
4       xh4     20      9       14
6       xh6     23      9       14
2       xm2     18      9       14 Time taken: 0.089 seconds, Fetched: 4 row(s) hive> select * from u4 tablesample(bucket 1 out of 8 on id) where age > 22; OK 8       xh8     28      9       14 Time taken: 0.075 seconds, Fetched: 1 row(s)

隨機查詢:

select * from u4 order by rand() limit 3;

OK
1       xm1     16      9       14
3       xm3     22      9       14
6       xh6     23      9       14
Time taken: 20.724 seconds, Fetched: 3 row(s)  --走map reduce任務

> select * from u4 tablesample(3 rows); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14 Time taken: 0.073 seconds, Fetched: 3 row(s)
hive> select * from u4 tablesample(30 percent); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14 Time taken: 0.058 seconds, Fetched: 3 row(s)
> select * from u4 tablesample(3G); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.069 seconds, Fetched: 9 row(s)
hive> select * from u4 tablesample(3K); OK 8       xh8     28      9       14
4       xh4     20      9       14
9       xh9     32      9       14
5       xh5     22      9       14
1       xm1     16      9       14
6       xh6     23      9       14
2       xm2     18      9       14
7       xh7     25      9       14
3       xm3     22      9       14 Time taken: 0.058 seconds, Fetched: 9 row(s)

* 分區與分桶的對比
* 分區使用表外的字段,分桶使用表內字段
* 分區可使用load加載數據,而分桶就必需要使用insert into方式加載數據
* 分區經常使用;分桶少用

hive數據導入

* load從本地加載
* load從hdfs中加載
* insert into方式加載
* location指定
* like指定,克隆
* ctas語句指定(create table as)
* 手動將數據copy到表目錄

hive數據導出
* insert into方式導出
* insert overwrite local directory:導出到本地某個目錄
* insert overwrite directory:導出到hdfs某個目錄

導出到文件

hive -S -e 「use gp1801;select * from u2」 > /home/out/02/result

相關文章
相關標籤/搜索