Hive去除重複數據操做

Hive是基於Hadoop的一個數據倉庫工具，能夠將結構化的數據文件映射爲一張數據庫表，並提供類SQL查詢功能mysql

hive的元數據存儲：一般是存儲在關係數據庫如 mysql(推薦) , derby（內嵌數據庫）中sql

hive的組成部分：解釋器、編譯器、優化器、執行器數據庫

hive具備sql數據庫的外表，但應用場景徹底不一樣，hive只適合用來作批量數據統計分析工具

hive中的數據表分爲內部表、外部表oop

當刪除內部表的時候，表中的數據會跟着一塊刪除post

刪除外部表時候，外部表會被刪除，外部表的數據不會被刪除測試

使用hive以前須要啓動hadoop集羣，由於hive須要依賴於hadoop集羣進行工做（hive2.0以前）優化

如下是對hive重複數據處理ui

先建立一張測試表spa

建表語句：create table hive_jdbc_test (key string,value string) partitioned by (day string) row format delimited fields terminated by ',' stored as textfile

準備的數據
　　uuid,hello=>0
　　uuid,hello=>0
　　uuid,hello=>1
　　uuid,hello=>1
　　uuid,hello=>2
　　uuid,hello=>2
　　uuid,hello=>3

把數據插入到2018-1-1分區

此時咱們對hive表數據進行去重操做

insert overwrite table hive_jdbc_test partition(day='2018-1-1')
select key,value
from (SELECT *, Row_Number() OVER (partition by key,value ORDER BY value desc) rank
FROM hive_jdbc_test where day='2018-1-1') t
where t.rank=1;

此時重複數據會被處理完畢