Hive分桶表總結

時間 2019-11-21

標籤 hive 總結欄目 Hadoop 简体版

原文原文鏈接

本文主要轉至：http://www.cnblogs.com/skyl/p/4737847.htmlhtml

Hive 中 table 能夠拆分紅 Partition table 和桶（BUCKET），對於Table或者Partition， Hive能夠進一步組織成桶，也就是說桶Bucket是更爲細粒度的數據範圍劃分。Bucket是對指定列進行hash，而後根據hash值除以桶的個數進行求餘，決定該條記錄存放在哪一個桶中。桶操做是經過 Partition 的 CLUSTERED BY 實現的，BUCKET 中的數據能夠經過 SORT BY 排序。android

優勢①：得到更高的查詢處理效率。桶爲表加上了額外的結構，Hive 在處理有些查詢時能利用這個結構。具體而言，鏈接兩個在相同列上劃分了桶的表，可使用 Map-side Join 的高效實現。sql

優勢②：抽樣（sampling）能夠在全體數據上進行採樣，這樣效率天然就低，它仍是要去訪問全部數據。而若是一個表已經對某一列製做了bucket，就能夠採樣全部桶中指定序號的某個桶，這就減小了訪問量。ide

缺點：使用業務字段來查詢的話，沒有什麼效果。oop

須要特別主要的是，CLUSTERED BY 和 SORT BY 不會影響數據的導入，這意味着，用戶必須本身負責數據的導入，包括數據分桶和排序。 ‘set hive.enforce.bucketing=true’ 能夠自動控制上一輪 Reduce 的數量從而適配 BUCKET 的個數，固然，用戶也能夠自主設置 mapred.reduce.tasks 去適配 BUCKET 個數，推薦使用：spa

操做示例以下。code

1) 建立臨時表 student_tmp，並導入數據。orm

hive> desc student_tmp;
hive> select * from student_tmp;

2). 建立桶表htm

使用 Clustered By 子句來指定劃分桶所用的列，以及劃分桶的個數。桶中的數據能夠根據一個或多個列進行排序Sorted by【此處默認是降序】。因爲這樣對每一個桶的鏈接變成了高效的歸併排序(merge-sort)，所以能夠進一步提高map端鏈接的效率。 blog

hive> create table student0
      (id INT, 
       age INT, 
       name STRING
       )
     partitioned by(stat_date STRING)
     row format delimited 
     fields terminated by ','; 
OK
Time taken: 0.292 seconds

hive> create table student1
      ( id INT, 
        age INT, 
        name STRING
       ) 
      partitioned by(stat_date STRING) 
      clustered by(id) sorted by(age) into 2 buckets 
      row format delimited 
      fields terminated by ',';
OK
Time taken: 0.215 seconds

3). 設置環境變量。讓程序自動分配reduce的數量從而適配相應的bucket;

hive> set hive.enforce.bucketing=true;

4). 導入數據

桶表 student1 加載數據 From Select 是通過MR的，而普通表 student0 加載數據 Load 是不須要啓動MR的。事實上，桶表數據文件對應MR的 Reduce輸出文件：桶n 對應於輸出文件 000000_n

[root@hadoop01 hive]# more bucket.txt
1,20,zxm
2,21,ljz
3,19,cds
4,18,mac
5,22,android
6,23,symbian
7,25,wp

hive> LOAD data local INPATH '/root/hive/bucket.txt' 
    > OVERWRITE INTO TABLE student0                  
    > partition(stat_date="20120802");

hive> from student0                                                   
    > insert overwrite table student1 partition(stat_date="20120802") 
    > select id,age,name where stat_date="20120802"                   
    > sort by age;

5) 查看文件目錄。

hive> dfs -ls /user/hive/warehouse/student1/stat_date=20120802;
Found 2 items
-rw-r--r--   1 root supergroup         31 2015-08-17 21:23
 /user/hive/warehouse/student1/stat_date=20120802/000000_0

-rw-r--r--   1 root supergroup         39 2015-08-17 21:23
 /user/hive/warehouse/student1/stat_date=20120802/000001_0

hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000000_0;
6,23,symbian
2,21,ljz
4,18,mac

hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000001_0;
7,25,wp
5,22,android
1,20,zxm
3,19,cds

6) 查看 sampling 數據。

hive> select * from student1                     
    > TableSample(bucket 1 out of 2 on id); 
OK
6       23      symbian 20120802
2       21      ljz     20120802
4       18      mac     20120802
Time taken: 10.871 seconds, Fetched: 3 row(s)

注：tablesample是抽樣語句，語法：TABLESAMPLE(BUCKET x OUT OF y)

y必須是桶數的整數倍或者因子。hive根據y的大小，決定抽樣的比例。例如，桶數64：