hive的分桶原理

時間 2019-11-06

標籤 hive 原理欄目 Hadoop 简体版

原文原文鏈接

套話之分桶的定義：node

　　分桶表是對列值取哈希值的方式，將不一樣數據放到不一樣文件中存儲。對於 hive 中每個表、分區均可以進一步進行分桶。mysql

列的哈希值除以桶的個數來決定每條數據劃分在哪一個桶中。（網上其它定義更詳細，有點繞，結合後面實例）sql

適用場景：數據抽樣（ sampling ）、map-join數據庫

乾貨之分桶怎麼分：vim

1.開啓支持分桶spa

set hive.enforce.bucketing=true;
默認：false；設置爲 true 以後，mr 運行時會根據 bucket 的個數自動分配 reduce task 個數。
（用戶也能夠經過 mapred.reduce.tasks 本身設置 reduce 任務個數，但分桶時不推薦使用）
注意：一次做業產生的桶（文件數量）和 reduce task 個數一致。3d

2.往分桶表中加載數據
insert into table bucket_table select columns from tbl;
insert overwrite table bucket_table select columns from tbl;code

3.桶表抽樣orm

select * from bucket_table tablesample(bucket 1 out of 4 on columns);
TABLESAMPLE 語法：
TABLESAMPLE(BUCKET x OUT OF y)
x：表示從哪一個 bucket 開始抽取數據
y：必須爲該表總 bucket 數的倍數或因子blog

4.分桶實例（詳解）

具體以下：

1.啓動hive（遠程一體化模式）：①service iptables stop // ② service mysqld start // ③hive ---service metastore //④ hive(老套路）

2.準備：在node03節點的root/hivedata目錄下建立一個數據文件ft

①vim ft

1       zhang   12
2       lisi    34
3       wange   23
4       zhouyu  15
5       guoji   45
6       xiafen  48
7       yanggu  78
8       liuwu   41
9       zhuto   66
10      madan   71
11      sichua  89

注意:這裏的數據間是用製表符'\t'來分隔的,後面在建表的時候要注意 terminated by '\t'; 否則導入表中的數據由於格式不符出現'null'

②在數據庫heh.db中建表:

hive> CREATE TABLE ft( id INT, name STRING, age INT)
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY'\t';
OK
Time taken: 0.216 seconds
hive> load data local inpath'/root/hivedata/ft' into table ft;
Loading data to table hehe.ft
Table hehe.ft stats: [numFiles=1, totalSize=127]
OK
Time taken: 1.105 seconds
hive> select *from ft;
OK
1    zhang    12
2    lisi    34
3    wange    23
4    zhouyu    15
5    guoji    45
6    xiafen    48
7    yanggu    78
8    liuwu    41
9    zhuto    66
10    madan    71
11    sichua    89
NULL    NULL    NULL
Time taken: 0.229 seconds, Fetched: 12 row(s)

再建立一張分桶表fentong並把ft的數據插入到fentong:

hive> create table fentong(
    > id  int,
    > name string,
    > age int,)clustered by(age) into 4 buckets
    > row format delimited fields terminated by ',';


建立一張表:它以字段age來劃分紅4個桶

插入數據:
hive> insert into table fentong select name,age from ft;

ok! 如今分桶表中出現以前建立的數據:select * from  fentong

③執行抽樣: select id, name, age from fentong tablesample(bucket 1 out of 4 on age);

網上不少案例教程說的很是繞,一時很難離清楚,現分享以下通俗易懂的教程:

怎麼分:①在前面建立分桶表的時候有這樣語句:age int,)clustered by(age) into 4 buckets 說明本案例是以年齡age來劃分紅4個桶;

分桶的數據怎麼分到四個桶:它是將表中對應的字段值(好比age)分別來除以桶的個數4,結果取餘數(也就是取模),若餘數爲0就放到1號桶,餘數爲1就放到2號桶
餘數爲2就放到3號桶,餘數爲3就放到4號桶

②這句話怎麼理解:select id, name, age from psnbucket tablesample(bucket 2 out of 4 on age);

它是說:將你的數據劃分紅4個桶,取四個桶中的第一個桶的數據

③運行程序

hive> select id, name, age from fentong tablesample(bucket 1 out of 4 on age);
OK
NULL    NULL    NULL
6    xiafen    48
1    zhang    12

hive> select id, name, age from fentong tablesample(bucket 2 out of 4 on age);
OK
11    sichua    89
8    liuwu    41
5    guoji            45

hive> select id, name, age from fentong tablesample(bucket 3 out of 4 on age);
OK
9    zhuto    66
7    yanggu    78
2    lisi    34

④推算過程:

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。