【轉載】Hive筆記整理（二）

時間 2019-11-09

標籤轉載 hive 筆記整理欄目 Hadoop 简体版

原文原文鏈接

目錄python

轉載自：https://blog.51cto.com/xpleaf/2084985
mysql

Hive筆記整理（二）

Hive中表的分類

managed_table—受控表、管理表、內部表

表中的數據的生命週期/存在與否，受到了表結構的影響，當表結構被刪除的，表中的數據隨之一併被刪除。

默認建立的表就是這種表。

能夠在cli中經過desc extended tableName來查看錶的詳細信息，固然也能夠在MySQL中hive的元數據信息表TBLS中查看。linux

external_table—外部表

表中的數據的生命週期/存在與否，不受到了表結構的影響，當表結構被刪除的，表中對應數據依然存在。

這至關於只是表對相應數據的引用。

建立外部表：web

shell

create external table t6_external(
id int
);

增長數據：
alter table t6_external set location "/input/hive/hive-t6.txt";算法

還能夠在建立外部表的時候就能夠指定相應數據
create external table t6_external_1(
id int
) location "/input/hive/hive-t6.txt";
上述hql報錯：
MetaException(message:hdfs://ns1/input/hive/hive-t6.txt is not a directory or unable to create one
意思是說在建立表的時候指定的數據，不指望爲一個具體文件，而是一個目錄sql

create external table t6_external_1(
id int
) location "/input/hive/";

當使用外部表時，是不容許刪除操做的，可是能夠添加數據，而且這樣作也會影響到hdfs中引用的文本數據。數據庫

內部表和外部表的簡單用途區別：安全

當考慮到數據的安全性的時候，或者數據被多部門協調使用的，通常用到外部表。

當考慮到hive和其它框架(好比hbase)進行協調集成的時候，通常用到外部表。

能夠對內部表和外部表進行互相轉換：session

外--->內部

alter table t6_external set tblproperties("EXTERNAL"="FALSE");

內部表---->外部表

alter table t2 set tblproperties("EXTERNAL"="TRUE");

持久表和臨時表

以上的表都是持久表，表的存在和會話沒有任何關係。

臨時表：

在一次會話session中建立的臨時存在的表，當會話斷開的時候，該表全部數據(包括元數據)都會消失，表數據是臨時存儲在內存中，(實際上，建立臨時表後，在hdfs中會建立/tmp/hive目錄來保存臨時文件，但只要退出會話，數據就會立刻刪除)

在元數據庫中沒有顯示。

這種臨時表一般就作臨時的數據存儲或交換

臨時表的特色

不能是分區表

建立臨時表很是簡單，和外部表同樣，將external關鍵字替換成temporary就能夠了

功能表

分區表

假如up_web_log表的結構以下：

user/hive/warehouse/up_web_log/

web_log_2017-03-09.log

web_log_2017-03-10.log

web_log_2017-03-11.log

web_log_2017-03-12.log

....

web_log_2018-03-12.log

對該表的結構解釋以下：

該表存放的是web日誌，在hive中，一個表就是hdfs中的一個目錄，web日誌的保存統計是按天進行的，因此天天結束後

都會將日誌數據加載到hive中，因此能夠看到up_web_log目錄下有多個文件，可是對於hive來講，這些日誌數據都是

屬於up_web_log這個表的，顯然，隨着時間的推移，這張表的數據會愈來愈多。

該表存在的問題：

原先的是一張大表，這張表下面存放有若干數據，要想查看其中某一天的數據，

只能首先在表中定義一個日期字段(好比：dt)，而後再在查詢的時候添加過濾字段where dt="2017-03-12"

如此才能求出相應結果，可是有個問題，這種方式須要加載該表下面全部的文件中的數據，形成在內存中加載了

大量的不相關的數據，形成了咱們hql運行效率低下。

那麼如何對該表進行優化呢？

要想對這個問題進行優化，咱們能夠依據hive表的特性，其實在管理的是一個文件夾，也就是說，
經過表可以定位到一個hdfs上面的目錄，咱們就能夠在該表/目錄的基礎之上再來建立一/多級子目錄，
來完成對該大表的一個劃/拆分，咱們經過某種特徵標識，好比子文件夾名稱datadate=2017-03-09...
之後再來查詢其中一天的數據的時候，只須要定位到該子文件夾，便可相似加載一張表數據同樣，加載
該子文件夾下面的全部的數據。這樣就不會再去全量加載該大表下面全部的數據，只加載了其中的一部分，
減小了內存數據量，提升了hql運行效率。

咱們把這種技術稱之爲，表的分區，這種表稱之爲分區表。把這個子文件夾稱之爲分區表的分區。

分區表的組成說明以下：

分區有兩部分組成，分區字段和具體的分區值組成，中間使用「=」鏈接，分區字段在整個表中的地位

就至關於一個字段，要想查詢某一分區下面的數據，以下操做 where datadate="2017-03-09"

hdfs中關於該表的存儲結構爲：

user/hive/warehouse/up_web_log/

/datadate=2017-03-09

web_log_2017-03-09.log

/datadate=2017-03-10

web_log_2017-03-10.log

/datadate=2017-03-11

web_log_2017-03-11.log

/datadate=2017-03-12

web_log_2017-03-12.log

....

web_log_2018-03-12.log

建立一張分區表：

create table t7_partition (
id int
) partitioned by (dt date comment "date partition field");

load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition;
FAILED: SemanticException [Error 10062]: Need to specify partition columns because the destination table is partitioned
不能直接向分區表加載數據，必須在加載數據以前明確加載到哪個分區中，也就是子文件夾中。

分區表的DDL：

建立一個分區：

alter table t7_partition add partition(dt="2017-03-10");

查看分區列表：

show partitions t7_partition;

刪除一個分區：

alter table t7_partition drop partition(dt="2017-03-10");

增長數據：

向指定分區中增長數據：

load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition partition (dt="2017-03-10");

這種方式，會自動建立分區

有多個分區字段的狀況：


統計學校，每一年，每一個學科的招生，就業的狀況/每一年就業狀況

create table t7_partition_1 (

id int

) partitioned by (year int, school string);
添加數據：
 load data local inpath '/opt/data/hive/hive-t6.txt' into table t7_partition_1 partition(year=2015, school='python');

桶表

分區表存在的問題：

由於分區表還有可能形成某些分區數據很是大，某些則很是小，形成查詢不均勻，這不是咱們所預期，

就須要使用一種技術，對這些表進行相對均勻的打散，把這種技術稱之爲分桶，分桶以後的表稱之爲桶表。

建立一張分桶表：

create table t8_bucket(

id int

) clustered by(id) into 3 buckets;

向分桶表增長數據：

只能從表的表進行轉換，不能使用上面的load這種方式（不會對數據進行拆分）
insert into t8_bucket select * from t7_partition_1 where year=2016 and school="mysql";
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different 't8_bucket':
Table insclause-0 has 1 columns, but query has 3 columns.
咱們的桶表中只有一個字段，可是分區表中有3個字段，因此在使用insert into 的方式導入數據的時候，
必定要注意先後字段個數必須保持一致。
insert into t8_bucket select id from t7_partition_1 where year=2016 and school="mysql";

增長數據後，查看錶中的數據：
> select * from t8_bucket;
OK
6
3
4
1
5
2
Time taken: 0.08 seconds, Fetched: 6 row(s)
能夠看到，數據的順序跟原來不一樣，那是由於數據分紅了3份，使用的分桶算法爲哈希算法，以下：
6%3 = 0, 3%3 = 0，放在第1個桶
4%3 = 1, 2%3 = 1，放在第2個桶
5%3 = 2, 2%3 = 2，放在第3個桶

查看hdfs中表t8_bucket的結構：
hive (mydb1)> dfs -ls /user/hive/warehouse/mydb1.db/t8_bucket;
Found 3 items
-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000000_0
-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000001_0
-rwxr-xr-x 3 uplooking supergroup 4 2018-03-09 23:00 /user/hive/warehouse/mydb1.db/t8_bucket/000002_0
能夠看到，數據被分別保存到t8_bucket的3個不一樣的子目錄中。

注意：操做分桶表的時候，本地模式不起做用。

數據的加載和導出

[]==>可選，<> ==>必須

加載

load

load data [local] inpath 'path' [overwrite] into table [partition_psc];

local：

有==>從linux本地加載數據

無==>從hdfs加載數據，至關於執行mv操做(無指的是沒有local參數時，而不是本地中沒有這個文件)

overwrite

有==>覆蓋掉表中原來的數據

無==>在原來的基礎上追加新的數據

從其餘表加載


insert <overwrite|into> [table(當前面參數爲overwrite時必須加table)] t_des select [...] from t_src [...];

overwrite

有==>覆蓋掉表中原來的數據

無==>在原來的基礎上追加新的數據

==>會轉化成爲MR執行
須要注意的地方：t_des中列要和select [...] from t_src這裏面的[...]一一對應起來。
 當選擇的參數爲overwrite時，後面必需要加table，如：
 insert overwrite table test select * from t8_bucket;

建立表的時候加載


create table t_des as select [...] from t_src [...];
這樣會建立一張表，表結構爲select [...] from t_src中的[...]
 eg.create temporary table tmp as select distinct(id) from t8_bucket;

動態分區的加載

快速複製表結構：


create table t_d_partition like t_partition_1;
hive (default)> show partitions t_partition_1;
 OK
 partition
 year=2015/class=bigdata
 year=2015/class=linux
 year=2016/class=bigdata
 year=2016/class=linux

要將2016的數據都到入到t_d_partition的相關的分區中：

insert into table t_d_partition partition(class, year=2016) select id, name, class from t_partition_1 where year=2016;

要將t_partition_1中全部數據都到入到t_d_partition的相關的分區中：


insert overwrite table t_d_partition partition(year, class) select id, name, year, class from t_partition_1;
(我操做時出現的提示：
 FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set shive.exec.dynamic.partition.mode=nonstrict
 )

其它問題：


從hdfs上面刪除的數據，並無刪除表結構，咱們show partitions t_d_partition;是從metastore中查詢出來的內容，若是你是手動刪除的hdfs上面數據，它的元數據信息依然在。
insert into t10_p_1 partition(year=2016, class) select * from t_partition_1;
 FAILED: SemanticException [Error 10094]: Line 1:30 Dynamic partition cannot be the parent of a static partition 'professional'
 動態分區不能成爲靜態分區的父目錄
 須要將hive.exec.dynamic.partition.mode設置爲nonstrict
 <property>
 <name>hive.exec.max.dynamic.partitions</name>
 <value>1000</value>
 <description>Maximum number of dynamic partitions allowed to be created in total.</description>
 </property>

import導入hdfs上的數據：


import table stu from '/data/stu';
目前測試時會出現下面的錯誤：
 hive (mydb1)> import table test from '/input/hive/';
 FAILED: SemanticException [Error 10027]: Invalid path
 hive (mydb1)> import table test from 'hdfs://input/hive/';
 FAILED: SemanticException [Error 10324]: Import Semantic Analyzer Error
 hive (mydb1)> import table test from 'hdfs://ns1/input/hive/';
 FAILED: SemanticException [Error 10027]: Invalid path