hive語句優化-經過groupby實現distinct (專貼)

時間 2019-11-22

標籤 hive 語句優化經過 groupby 實現 distinct 欄目 Hadoop 简体版

原文原文鏈接

hive語句優化-經過groupby實現distinct

寫了個hive的sql語句，執行效率特別慢，跑了一個多小時程序只是map完了，reduce進行到20%。
該Hive語句以下：
sql

select count(distinct ip)
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d

分析：select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"這個語句篩選出來的數據約有10億條，select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"約有10億條條，select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 篩選出來的數據約有10億條，總的數據量大約30億條。這麼大的數據量，使用disticnt函數，全部的數據只會shuffle到一個reducer上，致使reducer數據傾斜嚴重。
解決辦法：
首先，經過使用groupby，按照ip進行分組。改寫後的sql語句以下：
app

select count(*)
from
(select ip
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
group by ip ) b

而後，合理的設置reducer數量，將數據分散到多臺機器上。set mapred.reduce.tasks=50;
通過優化後，速度提升很是明顯。整個做業跑完大約只須要20多分鐘的時間。函數

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。