Hive 分區表總結

時間 2019-11-21

標籤 hive 分區總結欄目 Hadoop 简体版

原文原文鏈接

本文大部份內容轉至：http://blog.csdn.net/u010330043/article/details/51277868node

Hive 的分區經過在建立表時啓動 PARTITION BY 實現，用來分區的維度並非實際數據的某一列，具體分區的標誌是由插入內容時給定的。當要查詢某一分區的內容時能夠採用 WHERE 語句，例如使用「WHERE tablename.partition_key>a」建立含分區的表。建立分區語法以下。mysql

CREATE TABLE table_name(
...
)
PARTITION BY (dt STRING,country STRING)

一、建立分區sql

　　Hive 中建立分區表沒有什麼複雜的分區類型（範圍分區、列表分區、hash 分區，混合分區等）。分區列也不是表中的一個實際的字段，而是一個或者多個僞列。意思是說，在表的數據文件中實際並不保存分區列的信息與數據。數據庫

須要注意，Partitioned by子句中的列定義是表中正式的列，稱爲「分區列」partition column。

可是，數據文件並不包含這些列的值，由於他們源於目錄名。

建立一個簡單的分區表。oop

hive> create table partition_test
(member_id string,
 name string
) 
partitioned by (stat_date string,province string) 
row format delimited 
fields terminated by ',';

這個例子中建立了 stat_date 和 province 兩個字段做爲分區列。一般狀況下須要預先建立好分區，而後才能使用該分區。例如：spa

hive> alter table partition_test add partition (stat_date='2016-04-28',province='beijing');

這樣就建立了一個分區。這時會看到 Hive 在HDFS 存儲中建立了一個相應的文件夾。.net

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2015-01-18
/user/hive/warehouse/partition_test/stat_date=2016-04-28/province=beijing ----顯示剛剛建立的分區

每個分區都會有一個獨立的文件夾，在這個例子中stat_date是主層次，province是副層次，code

全部stat_date='20150118'，而province不一樣的分區都會在orm

/user/hive/warehouse/partition_test/stat_date=20110728 下面，

而stat_date不一樣的分區都會在blog

/user/hive/warehouse/partition_test/ 下面；

如：$ hadoop fs -ls /user/hive/warehouse/partition_test/
        Found 2 items
drwxr-xr-x - admin supergroup 0 2015-01-28 19:46
 /user/hive/warehouse/partition_test/stat_date=20150126
drwxr-xr-x - admin supergroup 0 2015-01-29 09:53
 /user/hive/warehouse/partition_test/stat_date=20150128

注意，由於分區列的值要轉化爲文件夾的存儲路徑，因此若是分區列的值中包含特殊值，如 '%', ':', '/', '#',它將會被使用%加上2字節的ASCII碼進行轉義，如：

hive> alter table partition_test add partition (stat_date='2011/07/28',province='zhejiang');
      OK
      Time taken: 4.644 seconds

$hadoop fs -ls /user/hive/warehouse/partition_test/
Found 3 items

drwxr-xr-x - admin supergroup 0 2015-01-29 10:06 
/user/hive/warehouse/partition_test/stat_date=2015/01/28

drwxr-xr-x - admin supergroup 0 2015-01-26 19:46
/user/hive/warehouse/partition_test/stat_date=20150129

drwxr-xr-x - admin supergroup 0 2016-01-29 09:53
/user/hive/warehouse/partition_test/stat_date=20150128

二、插入數據；

使用一個輔助的非分區表 partition_test_input 準備向 partition_test 中插入數據，實現步驟以下。
1) 查看 partition_test_input 表的結構和數據，命令以下：

hive> desc partition_test_input;  -- 表結構
hive> select * from partition_test_input;  -- 表數據

2）向 partition_test 的分區中插入數據，命令以下：

insert overwrite table partition_test 
partition(stat_date='2015-01-18',province='jiangsu')
select member_id,name from partition_test_input 
where stat_date='2016-04-28' 
and province='jiangsu';

向多個分區插入數據，命令以下。

hive> from partition_test_input
insert overwrite table partition_test partition(stat_date='2016-04-28',province='jiangsu') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='jiangsu'

insert overwrite table partition_test partition(stat_date='2016-04-28',province='sichuan') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='sichuan'

insert overwrite table partition_test partition(stat_date='2016-04-28',province='beijing') 
select member_id,name from partition_test_input where stat_date='2016-04-28' and province='beijing';

特別要注意，在其餘數據庫中，通常向分區表中插入數據時系統會校驗數據是否符合該分區，若是不符合會報錯。而在hive中，向某個分區中插入什麼樣的數據徹底是由人來控制的，由於分區鍵是僞列，不實際存儲在文件中，如：

hive> desc partition_test_input;
OK
stat_date string
member_id string
name string
province string

hive> select * from partition_test_input;
OK
20110526 1 liujiannan liaoning
20110526 2 wangchaoqun hubei
20110728 3 xuhongxing sichuan
20110728 4 zhudaoyong henan
20110728 5 zhouchengyu heilongjiang

而後我向partition_test的分區中插入數據：

hive> insert overwrite table partition_test partition(stat_date='20110728',province='henan') 
select member_id,name from partition_test_input 
where stat_date='20110728' and province='henan';

Total MapReduce jobs = 2
...
1 Rows loaded to partition_test
OK

hive> insert overwrite table partition_test partition(stat_date='20110527',province='liaoning') select member_id,name from partition_test_input;
Total MapReduce jobs = 2
...
5 Rows loaded to partition_test
OK

hive> select * from partition_test where stat_date='20110527' and province='liaoning';
OK
1 liujiannan 20110527 liaoning
2 wangchaoqun 20110527 liaoning
3 xuhongxing 20110527 liaoning
4 zhudaoyong 20110527 liaoning
5 zhouchengyu 20110527 liaoning

能夠看到在partition_test_input中的5條數據有着不一樣的stat_date和province，可是在插入到partition(stat_date='20110527',province='liaoning')這個分區後，5條數據的stat_date和province都變成相同的了，由於這兩列的數據是根據文件夾的名字讀取來的，而不是實際從數據文件中讀取來的

三、動態分區

按照上面的方法向分區表中插入數據，若是數據源很大，針對一個分區就要寫一個 insert ，很是麻煩。使用動態分區能夠很好地解決上述問題。動態分區能夠根據查詢獲得的數據自動匹配到相應的分區中去。

動態分區能夠經過下面的設置來打開：

set hive.exec.dynamic.partition=true;  
set hive.exec.dynamic.partition.mode=nonstrict;

動態分區的使用方法很簡單，假設向 stat_date=’2016-04-28’ 這個分區下插入數據，至於 province 插到哪一個子分區下讓數據庫本身來判斷。stat_date 叫作靜態分區列，province 叫作動態分區列。

hive> insert overwrite table partition_test partition(stat_date='2016-04-28',province)
select member_id,name province from partition_test_input where stat_date='2016-04-28';

注意，動態分區不容許主分區採用動態列而副分區採用靜態列，這樣將致使全部的主分區都要建立副分區靜態列所定義的分區。

hive.exec.max.dynamic.partitions.pernode：
每個 MapReduce Job 容許建立的分區的最大數量，若是超過這個數量就會報錯（默認值100）。

hive.exec.max.dynamic.partitions：一個 dml 語句容許建立的全部分區的最大數量（默認值100）。 
hive.exec.max.created.files：全部 MapReduce Job 容許建立的文件的最大數量（默認值10000）。

儘可能讓分區列的值相同的數據在同一個 MapReduce 中，這樣每個 MapReduce 能夠儘可能少地產生新的文件夾，能夠經過 DISTRIBUTE BY 將分區列值相同的數據放到一塊兒，命令以下。

insert overwrite table partition_test 
partition(stat_date,province)
select memeber_id,name,stat_date,province 
from partition_test_input 
distribute by stat_date,province;

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。