hive中distinct和group by優化

時間 2021-07-10

原文原文鏈接

1、避免使用count distinct ,容易引起性能問題 select distinct(user_id) from a ; 由於必須去重，因此Hive會把map階段的輸出全部分佈到一個reduce task中，容易引起性能問題，可以通過先group by ,再count得方式進行優化優化後：select count(*) from( select user_id from a group

>>阅读原文<<