hive查詢性能-where條件中的變量

以下面的例子,目標表是分區表,dt爲分區鍵,每一個分區數據量大概100w左右,爲獲取當前日期分區內的count值sql

select count(house_id) from ods_edw_house.ods_t_second_house 
where dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())),1),'-','')

執行計劃以下:express

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: ods_t_second_house
            Statistics: Num rows: 350710533 Data size: 223764328176 Basic stats: PARTIAL Column stats: PARTIAL
            Filter Operator
              predicate: (dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())), 1), '-', '')) (type: boolean)
              Statistics: Num rows: 175355266 Data size: 111882163768 Basic stats: COMPLETE Column stats: PARTIAL
              Select Operator
                expressions: house_id (type: int)
                outputColumnNames: house_id
                Statistics: Num rows: 175355266 Data size: 111882163768 Basic stats: COMPLETE Column stats: PARTIAL
                Group By Operator
                  aggregations: count(house_id)
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: PARTIAL
                  File Output Operator
                    compressed: false
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        Group By Operator
          aggregations: count(_col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: PARTIAL
          ListSink

發現數據量爲3億多條,就一全分區掃描,執行效率低下,而且集羣資源被耗光。感受和where條件傳參有關,改寫sql:apache

select count(house_id) from ods_edw_house.ods_t_second_house 
where dt = '20161213';

再看執行計劃,發生變化oop

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: ods_t_second_house
            filterExpr: (dt = '20161213') (type: boolean)
            Statistics: Num rows: 1464007 Data size: 928111944 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: house_id (type: int)
              outputColumnNames: house_id
              Statistics: Num rows: 1464007 Data size: 928111944 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(house_id)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        Group By Operator
          aggregations: count(_col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
          ListSink

能夠明顯看到,因爲where條件的調整,掃描的數據量也發生巨大變化,效率幾何級增加,而且對集羣資源也基本無消耗。究其緣由,unix

where dt = regexp_replace(date_sub(to_date(from_unixtime(unix_timestamp())),1),'-','')

unix_timestamp()是變化的,掃描數據時這個參數是變化的,因此執行計劃選擇全分區掃描,而dt='20161213'是固定的,避免了全表掃描regexp

相關文章
相關標籤/搜索