Hive的分區操做~~~~~~

時間 2019-11-12

標籤 hive 分區欄目 Hadoop 简体版

原文原文鏈接

1、Hive分區
(一)、分區概念：
爲何要建立分區：單個表數據量愈來愈大的時候，在Hive Select查詢中通常會掃描整個表
內容，會消耗不少時間作不必的工做。有時候只須要掃描表中關心的一部分數據，所以建表
時引入了partition概念。
(1)、Hive的分區和mysql的分區差別：mysql分區是將表中的字段拿來直接做爲分區字段，
而hive的分區則是分區字段不在表中。
(2)、怎麼分區：根據業務分區，(徹底看業務場景)選取id、年、月、日、男女性別、年齡段
或者是能平均將數據分到不一樣文件中最好,分區很差將直接致使查詢結果延遲。
(3)、分區細節:
一、一個表能夠擁有一個或者多個分區，每一個分區以文件夾的形式單獨存在表文件夾的目錄下。
二、表和列名不區分大小寫。
三、分區是以字段的形式在表結構中存在，經過describe table命令能夠查看到字段存在(算是
一個僞列)，可是該字段不存放實際的數據內容，僅僅是分區的表示。
四、分區有一級、二級、三級和多級分區：
五、建立動態分區、靜態分區、混合分區：
動態分區：能夠動態加載數據
靜態分區：能夠靜態加入數據
混合分區：動態和靜態結合加入數據mysql

hive的分區則是分區字段不在表中。***********************sql

2、分區案例
案例1：使用hive的分區表對1608c班學生信息按性別存儲；
create table if not exists part_1608c(
sno int,
sname string,
sage int,
saddress string
)
partitioned by(sex string)
row format delimited fields terminated by',';oop

說明;建立分區表：
partitioned by(sex string) 設置分區字段，而且分區字段不在表中url

vi part_1608c_nan.txt
10001,laowu,18,daxing
10002,laowang,48,fanshan
10003,laozhang,8,daxing
10004,laoxu,18,daxingorm

vi part_1608c_nv.txt
10005,xiaowu,28,daxing
10006,xiaowang,18,fanshan
10007,xiaozhang,18,daxing
10008,xiaoxu,18,daxingip

*************LOAD DATA 方式加載數據到分區**********************hadoop

對分區表數據的導入方式：
load data local inpath '/opt/data/part_1608c_nan.txt' into table part_1608c partition(sex='nan');
load data local inpath '/opt/data/part_1608c_nv.txt' into table part_1608c partition(sex='nv');ci

查看錶的分區：
show partitions part_1608c;string

添加分區：
alter table part_1608c add partition(sex='bunan');
或者
alter table part_1608c add partition(sex='bunv') partition(sex='bunanbunv');
重命名分區：
alter table part_1608c partition(sex='bunanbunv') rename to partition(sex='nannv');
刪除分區：
alter table part_1608c drop partition(sex='bunv');
alter table part_1608c drop partition(sex='bunan');
alter table part_1608c drop partition(sex='nannv');it

分區表的查詢：
說明：對於分區表，在嚴格模式下查詢分區表時必須使用where帶上分區字段和分區值!
set hive.mapred.mode=strict;
select * from part_1608c where sex='nan';

案例2：使用hive的分區表對1608c班學生信息按性別存儲；
建立一個普通表：
create table if not exists tb_students(
sno int,
sname string,
sage int,
saddress string,
sex string
)
row format delimited fields terminated by',';

vi tb_students.txt
10001,laowu,18,daxing,nan
10002,laowang,48,fanshan,nv
10003,laozhang,8,daxing,nan
10004,laoxu,18,daxing,nv
10005,xiaowu,28,daxing,nan
10006,xiaowang,18,fanshan,nv
10007,xiaozhang,18,daxing,nv
10008,xiaoxu,18,daxing,nan

load data local inpath '/opt/data/tb_students.txt' into table tb_students;

建立一個分區表：
create table if not exists part_1608c2(
sno int,
sname string,
sage int,
saddress string
)
partitioned by(sex string)
row format delimited fields terminated by',';

***************INSERT INTO 方式添加數據到分區表***************
insert into table part_1608c2 partition(sex='nan')
select sno,sname,sage,saddress from tb_students where sex='nan';

insert into table part_1608c2 partition(sex='nv')
select sno,sname,sage,saddress from tb_students where sex='nv';

總結：像上面兩種方式加載數據到分區的方式加靜態分區
靜態分區：指定分區數量和字段值(sex='nan'、sex='nv')
靜態分區的使用場景：當數據的分區字段數量和分區值肯定，而且分區數量比較少時使用靜態分區！

動態分區案例
案例2：將學生信息按年齡分區
建立一個普通表：
create table if not exists tb_students2(
sno int,
sname string,
saddress string,
sex string,
sage int
)
row format delimited fields terminated by',';

vi tb_students2.txt
10001,laowu,daxing,nan,18
10002,laowang,fanshan,nv,48
10003,laozhang,daxing,nan,8
10004,laoxu,daxing,nv,18
10005,xiaowu,daxing,nan,28
10006,xiaowang,fanshan,nv,18
10007,xiaozhang,daxing,nv,18
10008,xiaoxu,daxing,nan,18

load data local inpath '/opt/data/tb_students2.txt' into table tb_students2;

建立分區表：分區依據是年齡
create table if not exists part_students2(
sno int,
sname string,
saddress string,
sex string
)
partitioned by(sage int)
row format delimited fields terminated by',';

動態分區;使用動態方式實現按年齡分區
動態分區時只能以結果集的方式將數據動態分區到分區表：

要能使用動態分區，必須打開動態分區模式，而且設置分區模式爲非嚴格模式！
1.打開動態分區模式：
set hive.exec.dynamic.partition=true;
2.設置分區模式爲非嚴格模式
set hive.exec.dynamic.partition.mode=nonstrict;

insert into table part_students2 partition(sage)
select sno,sname,saddress,sex,sage from tb_students2;

總結：像上面插入分區表數據的方式是動態分區
動態分區：在插入數據時，不肯定分區數量而且分區數量不是特別大的時候能夠使用動態分區
動態分區，在插入數據的時分區字段的值是不肯定的！

**************混合分區****************
案例3：將用戶信息按國家和城市分區
建立用戶信息表：
create table if not exists users(
ucard int,
uname string,
contry string,
city string
)
row format delimited fields terminated by'\t';

加載數據：
load data local inpath '/opt/data/city.txt' into table users;

建立二級分區表：
create table if not exists part_users(
ucard bigint,
uname string
)
partitioned by(contry string,city string)
row format delimited fields terminated by'\t';

混合分區：有靜態分區字段和動態分區字段混合
insert into table part_users partition(contry="USA",city)
select ucard,uname,city from users where contry='USA';

insert into table part_users partition(contry="CH",city)
select ucard,uname,city from users where contry='CH';

insert into table part_users partition(contry="UK",city)
select ucard,uname,city from users where contry='UK';

混合分區注意;主分區字段必須是靜態字段、輔助分區能夠是動態。

靜態分區：
動態分區：
混合分區：

案例4：數據若是已經落地在hdfs系統的目錄下，若是建立hive表管理已經落地的數據！
模擬落地數據：
mkdir /opt/data/source
cd /opt/data/source
[root@Hadoop001 source]# mkdir 2016/01/01 -p
[root@Hadoop001 source]# mkdir 2016/01/02 -p
[root@Hadoop001 source]# mkdir 2016/01/03 -p
[root@Hadoop001 source]# mkdir 2016/01/04 -p
[root@Hadoop001 source]# mkdir 2016/02/04 -p
[root@Hadoop001 source]# mkdir 2016/02/03 -p
[root@Hadoop001 source]# mkdir 2016/02/02 -p
[root@Hadoop001 source]# mkdir 2016/02/01 -p
[root@Hadoop001 source]# mkdir 2016/03/01 -p
[root@Hadoop001 source]# mkdir 2016/02/01 -p
[root@Hadoop001 source]# mkdir 2016/03/02 -p
[root@Hadoop001 source]# mkdir 2016/03/03 -p
[root@Hadoop001 source]# mkdir 2016/03/04 -p

[root@Hadoop001 data]# cp flow.log ./source/2016/01/01/
[root@Hadoop001 data]# cp flow.log ./source/2016/01/02/
[root@Hadoop001 data]# cp flow.log ./source/2016/01/03/
[root@Hadoop001 data]# cp flow.log ./source/2016/03/01/
[root@Hadoop001 data]# cp flow.log ./source/2016/02/01/

將模擬數據上傳到hidfs系統：

hadoop fs -put ./source /

建立外部分區表，而且location指向數據目錄：
CREATE external TABLE IF NOT EXISTS part_flow(
id string,
phonenumber bigint,
mac string,
ip string,
url string,
tiele string,
colum1 string,
colum2 string,
colum3 string,
upflow int,
downflow int
)
partitioned by(year int,month int,day int)
row format delimited fields terminated by'\t'
location '/source';

給源數據添加分區：
alter table part_flow add partition(year=2016,month=01,day=01)
location 'hdfs:///source/2016/01/01';

alter table part_flow add partition(year=2016,month=03,day=01)
location 'hdfs:///source/2016/03/01/';

alter table part_flow add partition(year=2016,month=01,day=03) location 'hdfs:///source/2016/01/03/';