數據dept表的準備:node
--建立dept表 CREATE TABLE dept( deptno int, dname string, loc string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;
數據文件準備:mysql
vi detp.txt 10,ACCOUNTING,NEW YORK 20,RESEARCH,DALLAS 30,SALES,CHICAGO 40,OPERATIONS,BOSTON
數據表emp準備:sql
CREATE TABLE emp( empno int, ename string, job string, mgr int, hiredate string, sal int, comm int, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS textfile;
表emp數據準備:oop
vi emp.txt 7369,SMITH,CLERK,7902,1980-12-17,800,null,20 7499,ALLEN,SALESMAN,7698,1981-02-20,1600,300,30 7521,WARD,SALESMAN,7698,1981-02-22,1250,500,30 7566,JONES,MANAGER,7839,1981-04-02,2975,null,20 7654,MARTIN,SALESMAN,7698,1981-09-28,1250,1400,30 7698,BLAKE,MANAGER,7839,1981-05-01,2850,null,30 7782,CLARK,MANAGER,7839,1981-06-09,2450,null,10 7788,SCOTT,ANALYST,7566,1987-04-19,3000,null,20 7839,KING,PRESIDENT,null,1981-11-17,5000,null,10 7844,TURNER,SALESMAN,7698,1981-09-08,1500,0,30 7876,ADAMS,CLERK,7788,1987-05-23,1100,null,20 7900,JAMES,CLERK,7698,1981-12-03,950,null,30 7902,FORD,ANALYST,7566,1981-12-02,3000,null,20 7934,MILLER,CLERK,7782,1982-01-23,1300,null,10
把數據文件裝到表裏ui
load data local inpath '/home/hadoop/tmp/dept.txt' overwrite into table dept; load data local inpath '/home/hadoop/tmp/emp.txt' overwrite into table emp;
查詢語句this
select d.dname,d.loc,e.empno,e.ename,e.hiredate from dept d join emp e on e.deptno = d.deptno ; * 能夠看到走的是map reduce 程序
2、Hive分區
hive分區的目的
* hive爲了不全表掃描,從而引進分區技術來將數據進行劃分。減小沒必要要數據的掃描,從而提升效率。spa
hive分區和mysql分區的區別
* mysql分區字段用的是表內字段;而hive分區字段採用表外字段。
日誌
hive的分區技術
* hive的分區字段是一個僞字段,可是能夠用來進行操做。
* 分區字段不進行區分大小寫
* 分區能夠是表分區或者分區的分區,能夠有多個分區
hive分區根據
* 看業務,只要是某個標識能把數據區分開來。好比:年、月、日、地域、性別等
分區關鍵字
* partitioned by(字段)
分區本質
* 在表的目錄或者是分區的目錄下在建立目錄,分區的目錄名爲指定字段=值code
建立分區表:orm
create table if not exists u1( id int, name string, age int ) partitioned by(dt string) row format delimited fields terminated by ' '
stored as textfile;
數據準備:
[hadoop@master tmp]$ more u1.txt 1 xm1 16 2 xm2 18 3 xm3 22
加載數據:
load data local inpath '/home/hadoop/tmp/u1.txt' into table u1 partition(dt="2018-10-14");
查詢:
hive> select * from u1; OK 1 xm1 16 2018-10-14 2 xm2 18 2018-10-14 3 xm3 22 2018-10-14 Time taken: 5.919 seconds, Fetched: 3 row(s)
查詢分區:
hive> select * from u1 where dt='2018-10-15'; OK 1 xm1 16 2018-10-15 2 xm2 18 2018-10-15 3 xm3 22 2018-10-15 Time taken: 0.413 seconds, Fetched: 3 row(s)
Hive的二級分區
建立表u2
create table if not exists u2(id int,name string,age int) partitioned by(month int,day int) row format delimited fields terminated by ' ' stored as textfile;
導入數據:
load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);
數據查詢:
hive> select * from u2; OK 1 xm1 16 9 14 2 xm2 18 9 14 Time taken: 0.303 seconds, Fetched: 2 row(s)
分區修改:
查看分區:
hive> show partitions u1;
OK
dt=2018-10-14
dt=2018-10-15
增長分區:
> alter table u1 add partition(dt="2018-10-16"); OK
查看新增長的分區:
hive> show partitions u1; OK dt=2018-10-14 dt=2018-10-15 dt=2018-10-16 Time taken: 0.171 seconds, Fetched: 3 row(s)
刪除分區:
hive> alter table u1 drop partition(dt="2018-10-15"); Dropped the partition dt=2018-10-15 OK Time taken: 0.576 seconds hive> select * from u1 ; OK 1 xm1 16 2018-10-14 2 xm2 18 2018-10-14 3 xm3 22 2018-10-14 Time taken: 0.321 seconds, Fetched: 3 row(s)
3、hive動態分區
hive配置文件hive-site.xml 文件裏有配置參數:
hive.exec.dynamic.partition=true; 是否容許動態分區 hive.exec.dynamic.partition.mode=strict/nostrict; 動態區模式爲嚴格模式 strict: 嚴格模式,最少須要一個靜態分區列(需指定固定值) nostrict:非嚴格模式,容許全部的分區字段都爲動態。 hive.exec.max.dynamic.partitions=1000; 容許最大的動態分區 hive.exec.max.dynamic.partitions.pernode=100; 單個節點容許最大分區
動態分區表的建立語句與靜態分區表相同,不一樣之處在與導入數據,靜態分區表能夠從本地文件導入,可是動態分區表須要使用from…insert into語句導入。
create table if not exists u3(id int,name string,age int) partitioned by(month int,day int)
row format delimited fields terminated by ' ' stored as textfile;
導入數據,將u2表中的數據加載到u3中:
from u2 insert into table u3 partition(month,day) select id,name,age,month,day;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
解決方法:
要動態插入分區必需設置hive.exec.dynamic.partition.mode=nonstrict
hive> set hive.exec.dynamic.partition.mode;
hive.exec.dynamic.partition.mode=strict
hive> set hive.exec.dynamic.partition.mode=nonstrict;
而後再次插入就能夠了
查詢:
hive> select * from u3; OK 1 xm1 16 9 14 2 xm2 18 9 14 Time taken: 0.451 seconds, Fetched: 2 row(s)
hive分桶
分桶目的做用
* 更加細緻地劃分數據;對數據進行抽樣查詢,較爲高效;可使查詢效率提升
* 記住,分桶比分區,更高的查詢效率。
分桶原理關鍵字
* 分桶字段是表內字段,默認是對分桶的字段進行hash值,而後再模於總的桶數,獲得的值則是分區桶數。每一個桶中都有數據,但每一個桶中的數據條數不必定相等。
bucket
clustered by(id) into 4 buckets
分桶的本質
* 在表目錄或者分區目錄中建立文件。
分桶案例
* 分四個桶
create table if not exists u4(id int, name string, age int) partitioned by(month int,day int) clustered by(id) into 4 buckets row format delimited fields terminated by ' ' stored as textfile;
對分桶的數據不能使用load的方式加載數據,使用load方式加載不會報錯,可是沒有分桶的效果。
爲分桶表添加數據,須要設置set hive.enforce.bucketing=true;
首先將數據添加到u2表中
1 xm1 16 2 xm2 18 3 xm3 22 4 xh4 20 5 xh5 22 6 xh6 23 7 xh7 25 8 xh8 28 9 xh9 32
load data local inpath '/home/hadoop/tmp/u2.txt' into table u2 partition(month=9,day=14);
加載到桶表中:
from u2 insert into table u4 partition(month=9,day=14) select id,name,age where month = 9 and day = 14;
2019-03-31 15:43:26,755 Stage-1 map = 0%, reduce = 0% 2019-03-31 15:43:34,241 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.85 sec 2019-03-31 15:43:41,681 Stage-1 map = 100%, reduce = 25%, Cumulative CPU 1.95 sec 2019-03-31 15:43:45,855 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 3.21 sec 2019-03-31 15:43:47,927 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 4.35 sec 2019-03-31 15:43:48,959 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.35 sec MapReduce Total cumulative CPU time: 5 seconds 350 msec Ended Job = job_1554061731326_0001 Loading data to table db_hive.u4 partition (month=9, day=14) MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 5.35 sec HDFS Read: 20301 HDFS Write: 405 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 350 msec
加載日誌能夠看到有:Map: 1 Reduce: 4
對分桶進行查詢:tablesample(bucket x out of y on id)
* x:表示從哪一個桶開始查詢
* y:表示桶的總數,通常爲桶的總數的倍數或者因子。
* x不能大於y。
hive> select * from u4; OK 8 xh8 28 9 14 4 xh4 20 9 14 9 xh9 32 9 14 5 xh5 22 9 14 1 xm1 16 9 14 6 xh6 23 9 14 2 xm2 18 9 14 7 xh7 25 9 14 3 xm3 22 9 14 Time taken: 0.148 seconds, Fetched: 9 row(s)
> select * from u4 tablesample(bucket 1 out of 4 on id); OK 8 xh8 28 9 14 4 xh4 20 9 14 Time taken: 0.149 seconds, Fetched: 2 row(s) hive> select * from u4 tablesample(bucket 2 out of 4 on id); OK 9 xh9 32 9 14 5 xh5 22 9 14 1 xm1 16 9 14 Time taken: 0.069 seconds, Fetched: 3 row(s) hive> select * from u4 tablesample(bucket 1 out of 2 on id); OK 8 xh8 28 9 14 4 xh4 20 9 14 6 xh6 23 9 14 2 xm2 18 9 14 Time taken: 0.089 seconds, Fetched: 4 row(s) hive> select * from u4 tablesample(bucket 1 out of 8 on id) where age > 22; OK 8 xh8 28 9 14 Time taken: 0.075 seconds, Fetched: 1 row(s)
隨機查詢:
select * from u4 order by rand() limit 3;
OK
1 xm1 16 9 14
3 xm3 22 9 14
6 xh6 23 9 14
Time taken: 20.724 seconds, Fetched: 3 row(s) --走map reduce任務
> select * from u4 tablesample(3 rows); OK 8 xh8 28 9 14 4 xh4 20 9 14 9 xh9 32 9 14 Time taken: 0.073 seconds, Fetched: 3 row(s)
hive> select * from u4 tablesample(30 percent); OK 8 xh8 28 9 14 4 xh4 20 9 14 9 xh9 32 9 14 Time taken: 0.058 seconds, Fetched: 3 row(s)
> select * from u4 tablesample(3G); OK 8 xh8 28 9 14 4 xh4 20 9 14 9 xh9 32 9 14 5 xh5 22 9 14 1 xm1 16 9 14 6 xh6 23 9 14 2 xm2 18 9 14 7 xh7 25 9 14 3 xm3 22 9 14 Time taken: 0.069 seconds, Fetched: 9 row(s)
hive> select * from u4 tablesample(3K); OK 8 xh8 28 9 14 4 xh4 20 9 14 9 xh9 32 9 14 5 xh5 22 9 14 1 xm1 16 9 14 6 xh6 23 9 14 2 xm2 18 9 14 7 xh7 25 9 14 3 xm3 22 9 14 Time taken: 0.058 seconds, Fetched: 9 row(s)
* 分區與分桶的對比
* 分區使用表外的字段,分桶使用表內字段
* 分區可使用load加載數據,而分桶就必需要使用insert into方式加載數據
* 分區經常使用;分桶少用
hive數據導入
* load從本地加載
* load從hdfs中加載
* insert into方式加載
* location指定
* like指定,克隆
* ctas語句指定(create table as)
* 手動將數據copy到表目錄
hive數據導出
* insert into方式導出
* insert overwrite local directory:導出到本地某個目錄
* insert overwrite directory:導出到hdfs某個目錄
導出到文件
hive -S -e 「use gp1801;select * from u2」 > /home/out/02/result