同事總結的hivesql優化Hive是將符合SQL語法的字符串解析生成能夠在Hadoop上執行的M

時間 2019-11-07

標籤同事總結 hivesql 優化 hive 符合 sql 語法字符串解析生成能夠 hadoop 執行欄目 Hadoop 简体版

原文原文鏈接

同事總結的hive sql 優化
sql

　　Hive是將符合SQL語法的字符串解析生成能夠在Hadoop上執行的MapReduce的工具。數據庫

　　使用Hive儘可能按照分佈式計算的一些特色來設計sql，和傳統關係型數據庫有區別，分佈式

　　因此須要去掉原有關係型數據庫下開發的一些固有思惟。工具

　　基本原則：oop

　　1：儘可能儘早地過濾數據，減小每一個階段的數據量,對於分區表要加分區，同時只選擇須要使用到的字段測試

　　select ... from A優化

　　join Bspa

　　on A.key = B.key設計

　　where A.userid>10orm

　　and B.userid<10

　　and A.dt='20120417'

　　and B.dt='20120417';

　　應該改寫爲：

　　select .... from (select .... from A

　　where dt='201200417'

　　and userid>10

　　) a

　　join ( select .... from B

　　where dt='201200417'

　　and userid < 10　

　　) b

　　on a.key = b.key;

　　2：儘可能原子化操做，儘可能避免一個SQL包含複雜邏輯

　　可使用中間表來完成複雜的邏輯

　　drop table if exists tmp_table_1;

　　create table if not exists tmp_table_1 as

　　select ......;

　　drop table if exists tmp_table_2;

　　create table if not exists tmp_table_2 as

　　select ......;

　　drop table if exists result_table;

　　create table if not exists result_table as

　　select ......;

　　drop table if exists tmp_table_1;

　　drop table if exists tmp_table_2;

　　3：單個SQL所起的JOB個數儘可能控制在5個如下

　　4：慎重使用mapjoin,通常行數小於2000行，大小小於1M(擴容後能夠適當放大)的表才能使用,小表要注意放在join的左邊（目前TCL裏面不少都小表放在join的右邊）。

　　不然會引發磁盤和內存的大量消耗

　　5：寫SQL要先了解數據自己的特色，若是有join ,group操做的話，要注意是否會有數據傾斜

　　若是出現數據傾斜，應當作以下處理：

　　set hive.exec.reducers.max=200;

　　set mapred.reduce.tasks= 200;---增大Reduce個數

　　set hive.groupby.mapaggr.checkinterval=100000 ;--這個是group的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體數據量設置

　　set hive.groupby.skewindata=true; --若是是group by過程出現傾斜應該設置爲true

　　set hive.skewjoin.key=100000; --這個是join的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體數據量設置

　　set hive.optimize.skewjoin=true;--若是是join 過程出現傾斜應該設置爲true

　　6：若是union all的部分個數大於2，或者每一個union部分數據量大，應該拆成多個insert into 語句，實際測試過程當中，執行時間能提高50%

　　insert overwite table tablename partition (dt= ....)

　　select ..... from (

　　select ... from A

　　union all

　　select ... from B

　　union all

　　select ... from C

　　) R

　　where ...;

　　能夠改寫爲：

　　insert into table tablename partition (dt= ....)

　　select .... from A

　　WHERE ...;

　　insert into table tablename partition (dt= ....)

　　select .... from B

　　WHERE ...;

　　insert into table tablename partition (dt= ....)

　　select .... from C

　　WHERE ...;

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。