hive分區和桶

時間 2019-11-12

標籤 hive 分區欄目 Hadoop 简体版

原文原文鏈接

分區操做

Hive 的分區經過在建立表時啓動 PARTITION BY 實現，用來分區的維度並非實際數據的某一列，具體分區的標誌是由插入內容時給定的。當要查詢某一分區的內容時能夠採用 WHERE 語句，例如使用「WHERE tablename.partition_key>a」建立含分區的表。建立分區語法以下。html

CREATE TABLE table_name(...)PARTITION BY (dt STRING,country STRING)

一、建立分區node

Hive 中建立分區表沒有什麼複雜的分區類型（範圍分區、列表分區、hash 分區，混合分區等）。分區列也不是表中的一個實際的字段，而是一個或者多個僞列。意思是說，在表的數據文件中實際並不保存分區列的信息與數據。數據庫

建立一個簡單的分區表。ide

hive> create table partition_test(member_id string,name string) partitioned by (stat_date string,province string) row format delimited fields terminated by ',';

這個例子中建立了 stat_date 和 province 兩個字段做爲分區列。一般狀況下須要預先建立好分區，而後才能使用該分區。例如：oop

hive> alter table partition_test add partition (stat_date='2015-01-18',province='beijing');

這樣就建立了一個分區。這時會看到 Hive 在HDFS 存儲中建立了一個相應的文件夾。spa

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2015-01-18
/user/hive/warehouse/partition_test/stat_date=2015-01-18/province=beijing----顯示剛剛建立的分區

每個分區都會有一個獨立的文件夾，在上面例子中，stat_date 是主層次，province 是副層次。code

二、插入數據orm

使用一個輔助的非分區表 partition_test_input 準備向 partition_test 中插入數據，實現步驟以下。htm

1) 查看 partition_test_input 表的結構，命令以下。排序

hive> desc partition_test_input;

2) 查看 partition_test_input 的數據，命令以下。

hive> select * from partition_test_input;

3) 向 partition_test 的分區中插入數據，命令以下。

insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu';

向多個分區插入數據，命令以下。

hive> from partition_test_input insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu' insert overwrite table partition_test partition(stat_date='2015-01-28',province='sichuan') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='sichuan' insert overwrite table partition_test partition(stat_date='2015-01-28',province='beijing') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='beijing';

三、動態分區

按照上面的方法向分區表中插入數據，若是數據源很大，針對一個分區就要寫一個 insert ，很是麻煩。使用動態分區能夠很好地解決上述問題。動態分區能夠根據查詢獲得的數據自動匹配到相應的分區中去。

動態分區能夠經過下面的設置來打開：

set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;

動態分區的使用方法很簡單，假設向 stat_date='2015-01-18' 這個分區下插入數據，至於 province 插到哪一個子分區下讓數據庫本身來判斷。stat_date 叫作靜態分區列，province 叫作動態分區列。

hive> insert overwrite table partition_test partition(stat_date='2015-01-18',province) select member_id,name province from partition_test_input where stat_date='2015-01-18';

注意，動態分區不容許主分區採用動態列而副分區採用靜態列，這樣將致使全部的主分區都要建立副分區靜態列所定義的分區。

hive.exec.max.dynamic.partitions.pernode：每個 MapReduce Job 容許建立的分區的最大數量，若是超過這個數量就會報錯（默認值100）。

hive.exec.max.dynamic.partitions：一個 dml 語句容許建立的全部分區的最大數量（默認值100）。

hive.exec.max.created.files：全部 MapReduce Job 容許建立的文件的最大數量（默認值10000）。

儘可能讓分區列的值相同的數據在同一個 MapReduce 中，這樣每個 MapReduce 能夠儘可能少地產生新的文件夾，能夠經過 DISTRIBUTE BY 將分區列值相同的數據放到一塊兒，命令以下。

hive> insert overwrite table partition_test partition(stat_date,province)select memeber_id,name,stat_date,province from partition_test_input distribute by stat_date,province;

桶操做

Hive 中 table 能夠拆分紅 Partition table 和桶（BUCKET），桶操做是經過 Partition 的 CLUSTERED BY 實現的，BUCKET 中的數據能夠經過 SORT BY 排序。

BUCKET 主要做用以下。

1)數據 sampling；

2)提高某些查詢操做效率，例如 Map-Side Join。

須要特別主要的是，CLUSTERED BY 和 SORT BY 不會影響數據的導入，這意味着，用戶必須本身負責數據的導入，包括數據額分桶和排序。 'set hive.enforce.bucketing=true' 能夠自動控制上一輪 Reduce 的數量從而適配 BUCKET 的個數，固然，用戶也能夠自主設置 mapred.reduce.tasks 去適配 BUCKET 個數，推薦使用：

hive> set hive.enforce.bucketing=true;

操做示例以下。

1) 建立臨時表 student_tmp，並導入數據。

hive> desc student_tmp;hive> select * from student_tmp;

2) 建立 student 表。

hive> create table student(id int,age int,name string)partitioned by (stat_date string)clustered by (id) sorted by(age) into 2 bucketrow format delimited fields terminated by ',';

3) 設置環境變量。

hive> set hive.enforce.bucketing=true;

4) 插入數據。

hive> from student_tmp insert overwrite table student partition(stat_date='2015-01-19') select id,age,name where stat_date='2015-01-18' sort by age;

5) 查看文件目錄。

$ hadoop fs -ls /usr/hive/warehouse/student/stat_date=2015-01-19/

6) 查看 sampling 數據。

hive> select * from student tablesample(bucket 1 out of 2 on id);

tablesample 是抽樣語句，語法以下。

tablesample(bucket x out of y)

y 必須是 table 中 BUCKET 總數的倍數或者因子。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。