2018-09-03期 Hive 分區表

時間 2020-04-13

標籤 hive 分區欄目 Hadoop 简体版

原文原文鏈接

1、分區表概述express

分區表也是內部表，建立表時能夠同時爲表建立一個或多個分區，這樣咱們在加載數據時爲其指定具體的分區，查詢數據時能夠指定具體的分區從而提升效率，分區能夠理解爲表的一個特殊的列。關鍵字是partitioned。apache

分區表其實是將表文件分紅多個有標記的小文件方便查詢。oracle

2、建立分區表ide

這裏咱們將oracle用戶scott下的emp表導出的emp.csv文件在Hive中建立分區表存放，按照部門編號進行分區，emp表的字段信息值以下：oop

empno, ename, job, mgr, hiredate, salary, comm, deptnospa

7499, ALLEN, SALESMAN, 7698, 1981/2/20, 1600, 300, 30orm

hive> create table part_emp(hadoop

> empno int,ci

> ename string,input

> job string,

> mgr int,

> hiredate string,

> salary float,

> comm float

> )

> partitioned by (deptno int)

> row format delimited fields terminated by ',';

Time taken: 0.061 seconds

查看分區表，其中# Partition Information爲分區信息，有兩個分區year和city

hive> desc extended part_emp;

empno int None

ename string None

job string None

mgr int None

hiredate string None

salary float None

comm float None

deptno int None

# Partition Information

# col_name data_type comment

deptno int None

3、分區表插入數據

一、經過load命令加載數據

第一次分區信息爲deptno=10

hive> load data local inpath '/root/emp.csv_10' into table part_emp partition(deptno=10);

Copying data from file:/root/emp.csv_10

Copying file: file:/root/emp.csv_10

Loading data to table default.part_emp partition (deptno=10)

[Warning] could not update stats.

Time taken: 2.267 seconds

第二次分區信息爲deptno=20

hive> load data local inpath '/root/emp.csv_20' into table part_emp partition(deptno=20);

Copying data from file:/root/emp.csv_20

Copying file: file:/root/emp.csv_20

Loading data to table default.part_emp partition (deptno=20)

[Warning] could not update stats.

Time taken: 8.151 seconds

第三次分區信息爲deptno=30，第三次經過insert的方式加載分區信息

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

hive> load data local inpath '/root/emp.csv_30' into table part_emp partition(deptno=30);

Copying data from file:/root/emp.csv_30

Copying file: file:/root/emp.csv_30

Loading data to table default.part_emp partition (deptno=30)

[Warning] could not update stats.

Time taken: 7.344 seconds

4、根據分區查詢，分區很像是一個特殊的列

hive> select * from part_emp where deptno=10;

7782 CLARK MANAGER 7839 1981/6/9 2450.0 100.0 10

7839 KING PRESIDENT NULL 1981/11/17 5000.0 120.0 10

7934 MILLER CLERK 7782 1982/1/23 1300.0 133.0 10

8129 Abama MANAGER 7839 1981/6/9 2450.0 122.0 10

8131 Jimy PRESIDENT NULL 1981/11/17 5000.0 333.0 10

8136 Goodle CLERK 7782 1982/1/23 1300.0 421.0 10

查看分區表的分區信息

hive> show partitions part_emp;

deptno=10

deptno=20

deptno=30

5、分區表在HDFS上的存儲形式

一個分區對應一個目錄

6、觀察分區表查詢和普通表查詢的執行計劃

普通表

hive> explain select * from emp where deptno=10;

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME emp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL deptno) 10))))

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

emp

TableScan

alias: emp

Filter Operator

predicate:

expr: (deptno = 10)

type: boolean

Select Operator

expressions:

expr: empno

type: int

expr: ename

type: string

expr: job

type: string

expr: mgr

type: int

expr: hiredate

type: string

expr: salary

type: float

expr: comm

type: float

expr: deptno

type: int

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

分區表：

hive> explain select * from part_emp where deptno=10;

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME part_emp))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (TOK_TABLE_OR_COL deptno) 10))))

STAGE DEPENDENCIES:

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

TableScan

alias: part_emp

Select Operator

expressions:

expr: empno