寫了個hive的sql語句,執行效率特別慢,跑了一個多小時程序只是map完了,reduce進行到20%。
該Hive語句以下:
sql
select count(distinct ip)
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
分析:select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"這個語句篩選出來的數據約有10億條,select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"約有10億條條,select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 篩選出來的數據約有10億條,總的數據量大約30億條。這麼大的數據量,使用disticnt函數,全部的數據只會shuffle到一個reducer上,致使reducer數據傾斜嚴重。
解決辦法:
首先,經過使用groupby,按照ip進行分組。改寫後的sql語句以下:
app
select count(*)
from
(select ip
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
group by ip ) b
而後,合理的設置reducer數量,將數據分散到多臺機器上。set mapred.reduce.tasks=50;
通過優化後,速度提升很是明顯。整個做業跑完大約只須要20多分鐘的時間。函數