005.hive中order by,distribute by,sort by,cluster by

時間 2019-12-09

標籤 005.hive hive order distribute sort cluster 欄目 Hadoop 简体版

原文原文鏈接

order by,distribute by,sort by,cluster by 查詢使用說明

// 根據年份和睦溫對氣象數據進行排序，以確保全部具備相同年份的行最終都在一個reducer分區中 

// 一個reduce(海量數據,速度很慢)
select year, temperature
order by year asc, temperature desc
limit 100;  


// 多個reduce(海量數據,速度很快)
select year, temperature  
distribute by year  
sort by year asc, temperature desc
limit 100;

order by (全局排序 ) order by 會對輸入作全局排序，所以只有一個reducer（多個reducer沒法保證全局有序）只有一個reducer，會致使當輸入規模較大時，須要較長的計算時間。在hive.mapred.mode=strict模式下，強制必須添加limit限制，這麼作的目的是減小reducer數據規模例如，當限制limit 100時，若是map的個數爲50，則reducer的輸入規模爲100*50 distribute by (相似於分桶) 根據distribute by指定的字段對數據進行劃分到不一樣的輸出reduce 文件中。 sort by (相似於桶內排序) sort by不是全局排序，其在數據進入reducer前完成排序。所以，若是用sort by進行排序，而且設置mapred.reduce.tasks>1，則sort by只保證每一個reducer的輸出有序，不保證全局有序。 cluster by cluster by 除了具備 distribute by 的功能外還兼具 sort by 的功能。可是排序只能是倒序排序，不能指定排序規則爲asc 或者desc。所以，經常認爲cluster by = distribute by + sort by 參考地址： http://blog.csdn.net/jojo52013145/article/details/19199595 參考地址： http://blog.sina.com.cn/s/blog_9f48885501017aib.html

相關標籤/搜索