轉自:html
hive 結合執行計劃 分析 limit 執行原理:http://yaoyinjie.blog.51cto.com/3189782/923378sql
Hive 的 distribute byexpress
Order by 可以預期產生徹底排序的結果,可是它是經過只用一個reduce來作到這點的。因此對於大規模的數據集它的效率很是低。在不少狀況下,並不須要全局排序,此時能夠換成Hive的非標準擴展sort by。Sort by爲每一個reducer產生一個排序文件。在有些狀況下,你須要控制某個特定行應該到哪一個reducer,一般是爲了進行後續的彙集操做。Hive的distribute by 子句能夠作這件事。apache
所以,distribute by 常常和 sort by 配合使用。app
語句函數
SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
hive> SELECT * FROM logs; OK a 蘋果 3 a 橙子 3 a 燒雞 1 b 燒雞 3 hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
根據count分組,計算獨立用戶數。oop
1. 第一步先在mapper計算部分值,會以count和uid做爲key,若是是distinct而且以前已經出現過,則忽略這條計算。第一步是以組合爲key,第二步是以count爲key.
2. ReduceSink是在mapper.close()時才執行的,在GroupByOperator.close()時,把結果輸出。注意這裏雖然key是count和uid,可是在reduce時分區是按count來的!
3. 第一步的distinct計算的值沒用,要留到reduce計算的才準確。這裏只是減小了key組合相同的行。不過若是是普通的count,後面是會合並起來的。
4. distinct經過比較lastInvoke判斷要不要+1(由於在reduce是排序過了的,因此判斷distict的字段變了沒有,若是沒變,則不+1)ui
hive> explain select count, count(distinct uid) from logs group by count; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL count)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)))) (TOK_GROUPBY (TOK_TABLE_OR_COL count)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: logs TableScan //表掃描 alias: logs Select Operator//列裁剪,取出uid,count字段就夠了 expressions: expr: count type: int expr: uid type: string outputColumnNames: count, uid Group By Operator //先來map彙集 aggregations: expr: count(DISTINCT uid) //彙集表達式 bucketGroup: false keys: expr: count type: int expr: uid type: string mode: hash //hash方式 outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: //輸出的鍵 expr: _col0 //count type: int expr: _col1 //uid type: string sort order: ++ Map-reduce partition columns: //這裏是按group by的字段分區的 expr: _col0 //這裏表示count type: int tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator //第二次彙集 aggregations: expr: