Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

時間 2019-11-12

標籤 hadoop 2.6 hive 1.2.1 spark 1.4.1 欄目 Hadoop 简体版

原文原文鏈接

1. 新建表

1) 新建表結構

create table user_table(oop

id int,性能

userid bigint,優化

name string,spa

describe string comment 'desc表示用戶的描述'orm

)排序

comment '這是用戶信息表'內存

partitioned by(country string, city string) -- 創建分區，所謂的分區就是文件夾hadoop

clustered by (id) sorted by (userid) into 32 bucketsci

//經過id進行hash取值來分桶，桶類經過userid來排序排序string

分桶便於有用數據加載到有限的內存中（性能上的優化----還有join,group by,distinct）

row format delimited -- 指定分隔符解析數據

fields terminated by '\001' -- 字段之間的分隔符

collection items terminated by '\002' -- array字段內部的分隔符

map keys terminated by '\003' -- map字段內部分隔符

//用來分隔符解析數據（load進去的原始數據，hive是不會對它進行任何處理）

stored as textfile; -- 存儲格式( rcfile/ textfile / sequencefile )

//存儲格式(原始數據，就是textfile格式就行)

總結：

相比textfile和SequenceFile，rcfile因爲列式存儲方式，數據加載時性能消耗較大，可是具備較好的壓縮比和查詢響應。數據倉庫的特色是一次寫入、屢次讀取，所以，總體來看，rcfile相比其他兩種格式具備較明顯的優點。

a) Table 內部表（大小寫無所謂）

建立:

create table t1(id string);

create table t2(id string, name string) row format delimited fields terminated by '\t';

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t1;

load data inpath '/seq100w.txt' into table t1; (hdfs中數據移動到/hive/t1文件夾中)

（所以咱們直接把hdfs中數據移動到咱們表對應的文件夾中也能讀取到數據）

load data local inpath '/root/Downloads/seq100w.txt' overwrite into table t1;

b) Partition 分區表

建立:

create table t3(id string) partitioned by (province string);

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

查看某個表中全部的分區

Hive>show partitions 表名;

c) Bucket Table 桶表

建立: create table t4(id string) clustered by (id) into 4 buckets; //經過id來分桶

create table t4(id string) clustered by (id) sorted by (id asc) into 4 buckets; //對桶中數據進行升序排序，使每一個桶的鏈接變成了高效的合併排序（merge-sort）,所以能夠進一步提高map端鏈接的效率

設置均勻插入：set hive.enforce.bucketing = true;

加載: insert into table t4 select id from t3 where province='beijing';

覆蓋： insert overwrite table bucket_table select name from stu;

抽樣查詢：select * from bucket_table tablesample(bucket 1 out of 4 on id); //表示在表中隨機選擇1個桶的數據

select * from bucket_table tablesample(bucket 1 out of 2 on id); //表示隨機選擇半個桶的數據

select * from bucket_table tablesample(bucket 1 out of 4 on rand()); //表示隨機選擇1個桶的數據的部分數據（從某個桶中取樣，它會掃描整個表的數據集）

l 數據加載到桶表時，會對字段取hash值，而後與桶的數量取模。把數據放到對應的文件中。任何一桶裏都會有一個隨機的用戶集合

d) External Table 外部表

（t5能夠不放在倉庫中，能夠自定義存儲位置,以wlan爲倉庫）

建立: create external table t5(id string) location '/wlan'; wlan 表示文件夾

EXTERNAL關鍵字表示建立外部表；數據有外部倉庫控制，不是由hive控制，只有元數據（也就是表結構）由hive控制；所以不會把數據移到hive的倉庫目錄下，而是移動到外部倉庫中去，當你drop table 表名，元數據(表結構)會刪除，可是數據在外部倉庫中，所以不會被hive刪除。

hive>create external table t1(id ) row format delimited fields terminated by '\t' location ‘/wlan’；加上便於讀取數據，查詢的時候不會爲Null（\t就是數據的分隔符） ;wlan 表示文件夾，wlan最好與你要建立的表名一致，這樣方便查看和管理

create external table hadoop_1(id int,name string) row format delimited fields terminated by '\t' location '/wenjianjia';

load data inpath '/wenjianjia/hello' into table hadoop_1 ;

2) 複製現有表結構

// 新建new_table 表結構和 user_table 同樣

create table new_table like user_table;

3) 表重命名

hive> alter table new_table rename to new_table_1;

4) 建立表分區

建立:

create table t3(id string) partitioned by (province string);

加載:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

2. 刪除表

1) 清空表中數據

hadoop fs –rmr /… 直接刪除表在hdfs中存放的數據就行

若是不當心把表也在hdfs中刪除了

2) 刪除表

drop table test1

3) 刪除表分區（刪除分區和分區中的數據）

hive> alter table dm_newuser_active_month drop partition (batch_date="201404");

刪除表分區，必定要batch_date必定要加：冒號

3. 修改表信息

1) 表添加一個字段

hive> alter table test1 add columns(name string);

2) 修改表的某個字段

注意：change 取代現有表的要修改的列，它修改表模式而不是數據。

alter table 表名 change 要修改的列名修改後的列名修改後的類型 comment ‘備註信息’;

3) 修改表的全部字段

注意：replace 取代現有表的全部列，它修改表模式而不是數據。

alter table 表名replace columns(age int comment 'only keep the first column');

4) 添加表分區

hive> alter table ods_smail_mx_201404 add partition (day=20140401); 單獨添加分區

create table user_table_2(

id int,

name string

)

comment '這是用戶信息表'

partitioned by(dt string)

stored as textfile;

insert overwrite table user_table_2

partition(dt='2015-11-01')

select id, col2 name

from table_4;

4. 查看錶

1) 查看建表語句

show create table tmp_jzl_20150310_diff;

2) 查看錶結構

desc tmp_jzl_20150310_diff;

3) 查看錶分區

show partitions tmp_jzl_20150310_diff;

4) 查看庫中表名

hive> use tmp;

查看tmp庫中全部的表

hive> show tables;

查看tmp庫中 tmp_jzl_20150504開頭的表

hive> show tables 'tmp_jzl_20150504*';

tmp_jzl_20150504_1

tmp_jzl_20150504_2

tmp_jzl_20150504_3

tmp_jzl_20150504_4

相關標籤/搜索

hadoop+spark+hive+mysql

hadoop+hive+hbase+spark

hadoop+hive+spark+hbase

hadoop+hive+mysql

hadoop+zookeeper+hive

Spark

Hadoop

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。