以下面的例子,目標表是分區表,dt爲分區鍵,每一個分區數據量大概100w左右,爲獲取當前日期分區內的count值sql
select count(house_id) from ods_edw_house.ods_t_second_house where dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())),1),'-','')
執行計劃以下:express
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: ods_t_second_house Statistics: Num rows: 350710533 Data size: 223764328176 Basic stats: PARTIAL Column stats: PARTIAL Filter Operator predicate: (dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())), 1), '-', '')) (type: boolean) Statistics: Num rows: 175355266 Data size: 111882163768 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: house_id (type: int) outputColumnNames: house_id Statistics: Num rows: 175355266 Data size: 111882163768 Basic stats: COMPLETE Column stats: PARTIAL Group By Operator aggregations: count(house_id) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: PARTIAL File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: Group By Operator aggregations: count(_col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: PARTIAL ListSink
發現數據量爲3億多條,就一全分區掃描,執行效率低下,而且集羣資源被耗光。感受和where條件傳參有關,改寫sql:apache
select count(house_id) from ods_edw_house.ods_t_second_house where dt = '20161213';
再看執行計劃,發生變化oop
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: ods_t_second_house filterExpr: (dt = '20161213') (type: boolean) Statistics: Num rows: 1464007 Data size: 928111944 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: house_id (type: int) outputColumnNames: house_id Statistics: Num rows: 1464007 Data size: 928111944 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(house_id) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: Group By Operator aggregations: count(_col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE ListSink
能夠明顯看到,因爲where條件的調整,掃描的數據量也發生巨大變化,效率幾何級增加,而且對集羣資源也基本無消耗。究其緣由,unix
where dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())),1),'-','')
unix_timestamp()是變化的,掃描數據時這個參數是變化的,因此執行計劃選擇全分區掃描,而dt='20161213'是固定的,避免了全表掃描regexp