hive 優化實踐

時間 2020-01-21

標籤 hive 優化實踐欄目 Hadoop 简体版

原文原文鏈接

背景

在工做中須要同步pg數據庫下的某張表到hive，使用的工具是開源的sqoop，業務表的數據表包含最近一年的數據，數據表的行數爲366,830,898，數據表的字段個數爲71個，數據表在pg中的空間大小爲110G；pg中表沒有惟一主鍵，同一個id的數據可能會出現屢次，且都是業務容許的正常場景。node

分析

全量同步數據
每次將pg的整表全量同步到hive分區表中，這種作法同步的速度很慢，這張表後面會有依賴，會影響後續數據的產出，且會有數據表block的風險，因此不適合用全量的方式同步
增量同步數據
查看天天有改動的數據的記錄數，最多爲200萬，數據的體積約爲700M。嘗試單獨用sqoop同步一天的數據速度很快，sqoop同步的時候要指定map劃分的split字段，因此在pg中先在查詢和分割字段上加上索引。增量的數據要和前一天的全量分區作合併，由於同一個id無論在增量的表中仍是全量的表中都會出現多行記錄，因此並不能使用A left join B where b.id is not null的方式去處理，也不能left semi join 去處理。

嘗試

not in 方式
查詢在hive原始表中同時不在增量表中的id對應的數據，接着和增量的最新數據作union all。

hive -e "
insert overwrite table target_table partition(dt = '$curr_date')
select 
    a.*
from a
where 
    a.dt = '$curr_date - 1'
    and a.id not in (select id from b where b.dt = '$curr_date')
union all
select 
    *
from b 
where b.dt = '$curr_date'
"

這種方式理論上是能夠實現的，實際執行中發現任務最後會生成不少的小文件。
1. 嘗試手動設定reduce的數量 set mapred.reduce.tasks = 64，實際執行中並未起做用。
2. 嘗試在map階段先進行文件合併，例以下面的設置，map的數量確實減小了，可是reducer數量仍是同樣沒變。同理設定reduce完成之後的文件合併，同樣不起做用。算法

set mapred.min.split.size=100000000;
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=50000000;
set mapred.min.split.size.per.rack=50000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

exists 方式
將not in 改寫成not exists方式以下，執行的時候會啓動64個reducer去執行reduce，速度相對上面的流程會快，可是這時候出現的了輕微數據傾斜的問題，部分節點執行的較慢，影響了整個任務的執行效率，且生成的文件大小不一致，大小差別很大。

hive -e "
set mapred.reduce.tasks = 64;
insert overwrite table traget_table partition (dt = '$curr_date')
 
select
a.*
from a
where
   a.dt = '$curr_date - 1'
   and not exists (select 1 from b where b.dt = '$curr_date' and a.id = b.id)
union all
select
   *
from b where b.dt =  '$curr_date'
"

更新sql以下，加上 DISTRIBUTE by rand ()，distribute by是控制在map端如何拆分數據給reduce端的，hive會根據distribute by後面列，對應reduce的個數進行分發，默認是採用hash算法。rand()方法會生成一個[0,1]之間的隨機數，經過隨機數進行數據的劃分，由於每次都隨機的，因此每一個reducer上的數據會很均勻。以下的這種設置，reducer會有64個，且每一個reducer上的數據量幾乎同樣。sql

hive -e "
set mapred.reduce.tasks = 64;
insert overwrite table traget_table partition (dt = '$curr_date')
 
select
a.*
from a
where
   a.dt = '$curr_date - 1'
   and not exists (select 1 from b where b.dt = '$curr_date' and a.id = b.id)
union all
select
   *
from b where b.dt =  '$curr_date'
DISTRIBUTE by rand ();
"

效果以下：
數據庫