hive sql 優化數據傾斜

時間 2020-04-18

標籤 hive sql 優化數據傾斜欄目 Hadoop 简体版

原文原文鏈接

此腳本運行速度慢，主要是reduce端數據傾斜致使的，瞭解到dw.fct_traffic_navpage_path_detl表是用來收集用戶點擊數據的，那麼最終ide

購物車和下單的點擊確定極少，因此此表ordr_code字段爲空和cart_prod_id字段爲NULL的數據量極大，以下所示：spa

select ordr_code,count(*) as a from dw.fct_traffic_navpage_path_detl where ds = '2015-05-10' group by ordr_code having a>10000 ;code

151722135it

select cart_prod_id,count(*) as a fromdw.fct_traffic_navpage_path_detl where ds = '2015-05-10' groupby cart_prod_id having a>10000 ;io

NULL 127233335table

對於create table tmp_lifan_trfc_tpa as 這句SQL，BI加上以下配置，ast

set hive.mapjoin.smalltable.filesize = 120000000; //由於 dw.univ_parnt_tranx_comb_detl表最大不超過120MB，若是是hive on tez要用hive.auto.convert.join.noconditionaltask.size ，這樣tez會生成BROADCASTclass

sethive.auto.convert.join=true;隨機數

同時修改SQL以下語句：配置

from dw.fct_traffic_navpage_path_detl t

left outer join dw.univ_parnt_tranx_comb_detl o //用mapjoin解決數據傾斜

on t.ordr_code = o.parnt_ordr_code

and t.cart_prod_id = o.comb_prod_id

and o.ds = '2015-05-10'

left outer join bic.cust_first_ordr_tranx f

on case when o.end_user_id is null then cast(rand(9)*100as bigint) else o.end_user_id end = f.end_user_id //join後數傾斜用隨機數避免傾斜，紅色爲修改部分

and f.first_ordr_date_id = '2015-05-10'

where t.ds = '2015-05-10';

運行後SQL能夠在可控時間內完成。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

hive sql 優化 數據傾斜

hive sql 優化數據傾斜