大數據學習 hive

繼續以前MapReduce以後的大數據學習python

################################數據庫

hive的安裝：tar -zxcf hive.gz -C /app 安裝hive；json

改配置hive-defualt.xmlapp

在hive/bin加jdbc jareclipse

在hadoop加jline.jar包ide

啓動hdfs和yarn函數

啓動hive，hive/binoop

hive> show databases; 查看數據庫學習

truncate table 表名；清空表；大數據

drop table 表名；刪除表；

create table User_b(id int,name string) //定義好數據庫表的格式
> row format delimited
> fields terminated by ',';

hadoop fs -put unicom.dat /user/hive/warehouse/bigdata.db/user_b //將按照格式要求的文件寫好放入hdfs中

select id,name from user_b order by name; //hive利用hdfs存儲數據，啓用MapReduce查詢數據；

hive用服務方式啓動：

開啓服務：bin/hiveserver

開啓客戶端 ./beeline

beeline>!connect jdbc:hive2//localhost:10000

hive 細節：

hive> create external table ext_user_b(id int ,name string)
> row format delimited fields terminated by '\t' //分隔字段
> stored as textfile //存儲文件爲textfile類型
> location '/class03'; //存儲位置

DML上傳數據：

hive>load data local inpath '/home/hadoop/aaa.dat' into(插入，overwrite爲覆蓋) table ext_user_b;

drop table 外部表；//刪除外部表，實體文件不會被刪除；

分區：

create table part_user_b(id int ,name string)

partitioned by (country string) //區分屬性，要和表中定義的屬性不一樣

row format delimited

fields terminated by ','；

hive>load data local inpath '/home/hadoop/aaaChina.dat' into(插入，overwrite爲覆蓋) table part_user_b partition（country=‘China’）;

hive>load local inpath '/home/hadoop/aaajapan.dat' into(插入，overwrite爲覆蓋) table part_user_b partition（country=‘Japan’）;

select * from part_user_b where country='Japan'; //經過where，對不一樣分區進行查詢

增長分區

alter table part_user_b add（輸出爲drop）partition（country=‘america’）

show partition part_user_b； //查看分區

select * from part_user_b where country='China'; //查詢分區

分桶：

用查詢結果存入分桶表

建立表：

create table buck_user_b(id int ,string name)

clustered by(id)

into 4 buckets

row format delimited fields terminated by ','

設置：

hive>set hive.enforce.bucketing = ture;

hive>set mapred.reduce.tasks=10; //分桶數要與建立表的設置同樣-- 設置reduce的數量

插入數據

insert into table bubk

select id,name from stedent cluster by (id) ;

或者 select id ,name from sudent distribute by (Sno) sort by (Sno);

查詢結果保存：

create table 新建表 as select * from 查詢表；

insert overwrite（into）table 表名；

join查詢：

select * from a inner join b on a.id=b.id; //將兩張表中id相同的條目聚合成一張表

//a.id a.name b.id b.name

2 xx 2 yy

3 xx 3 rr

4 xx 4 dd

select * from a left join b on a.id=b.id; //以a表爲基準，a和b連成一張表；

//a.id a.name b.id b.name

1 xe NULL NULL

2 xx 2 yy

3 xx 3 rr

4 xx 4 dd

select * from a semi join b on a.id=b.id; //至關於inner，只返回左邊一半

select * from a full join b on a.id=b.id; //返回所有

思考：
查詢與「劉晨」在同一個系學習的學生
　　hive> select s1.Sname from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

hive函數：

查看官方文檔；

自定義函數簡單例子：

在eclipse寫好功能後打成jar包；

hive>add jar /home/hadoop/document/weide_udf_touppercase.jar;//加入jar包

hive>create temporary function toup as 'cn.unicom.bigdata.hive.func.ToUppercase';//將jia包和自定義類名關聯

函數處理json數據：

內置函數 get_json_object(line,$.id) as newid,get_json_object(line,$.id) as newid;

多層的json數據須要用自定會函數處理；

transform:

能夠用python腳本做爲處理文件處理數據；