SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs; a 蘋果 5 a 橙子 3 a 蘋果 2 b 燒雞 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1
默認設置了hive.map.aggr=true,因此會在mapper端先group by一次,最後再把結果merge起來,爲了減小reducer處理的數據量。注意看explain的mode是不同的。mapper是hash,reducer是mergepartial。若是把hive.map.aggr=false,那將groupby放到reducer才作,他的mode是complete.html
hive> explain SELECT uid, sum(count) FROM logs group by uid; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan // 掃描表 alias: logs Select Operator //選擇字段 expressions: expr: uid type: string expr: count type: int outputColumnNames: uid, count Group By Operator //這裏是由於默認設置了hive.map.aggr=true,會在mapper先作一次聚合,減小reduce須要處理的數據 aggregations: expr: sum(count) //彙集函數 bucketGroup: false keys: //鍵 expr: uid type: string mode: hash //hash方式,processHashAggr() outputColumnNames: _col0, _col1 Reduce Output Operator //輸出key,value給reducer key expressions: expr: _col0 type: string sort order: + Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE._col0) //聚合 bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial