hive語句優化-經過groupby實現distinct (專貼)

 

hive語句優化-經過groupby實現distinct

 

       寫了個hive的sql語句,執行效率特別慢,跑了一個多小時程序只是map完了,reduce進行到20%。
該Hive語句以下:
sql

select count(distinct ip) 
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"  
union all 
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10" 
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 
) d 


       分析:select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"這個語句篩選出來的數據約有10億條,select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"約有10億條條,select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 篩選出來的數據約有10億條,總的數據量大約30億條。這麼大的數據量,使用disticnt函數,全部的數據只會shuffle到一個reducer上,致使reducer數據傾斜嚴重。
       解決辦法:
       首先,經過使用groupby,按照ip進行分組。改寫後的sql語句以下:
app

select count(*) 
from 
(select ip 
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10" 
union all 
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10" 
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d 
group by ip ) b 

       而後,合理的設置reducer數量,將數據分散到多臺機器上。set mapred.reduce.tasks=50; 
       通過優化後,速度提升很是明顯。整個做業跑完大約只須要20多分鐘的時間。
函數

相關文章
相關標籤/搜索