hive中order by、distribute by、sort by和cluster by的區別和聯繫

時間 2019-11-06

標籤 hive order distribute sort cluster 區別聯繫欄目 Hadoop 简体版

原文原文鏈接

order by

order by 會對數據進行全局排序,和oracle和mysql等數據庫中的order by 效果同樣，它只在一個reduce中進行因此數據量特別大的時候效率很是低。mysql

並且當設置：set hive.mapred.mode=strict的時候不指定limit，執行select會報錯，以下：sql

LIMIT must also be specified。數據庫

sort by 是單獨在各自的reduce中進行排序，因此並不能保證全局有序，通常和distribute by 一塊兒執行，並且distribute by 要寫在sort by前面。oracle

若是mapred.reduce.tasks=1和order by效果同樣，若是大於1會分紅幾個文件輸出每一個文件會按照指定的字段排序，而不保證全局有序。排序

sort by 不受 hive.mapred.mode 是否爲strict ,nostrict 的影響。ci

DISTRIBUTE BY 控制map 中的輸出在 reducer 中是如何進行劃分的。使用DISTRIBUTE BY 能夠保證相同KEY的記錄被劃分到一個Reduce 中。it

distribute by 和 sort by 合用就至關於cluster by，可是cluster by 不能指定排序爲asc或 desc 的規則，只能是升序排列。效率

相關標籤/搜索