hive裏的group by和distinct

時間 2019-11-18

原文原文鏈接

hive裏的group by和distinct

前言

今天才明確知道group by實際上仍是有去重讀做用的，其實細想一下，按照xx分類，確定相同的就算是一類了，也就至關於去重來，詳細的看一下。面試

group by

看一下實例1：

hive> select * from test;
OK
zhao    15  20170807
zhao    14  20170809
zhao    15  20170809
zhao    16  20170809

hive> select name from test;
OK
zhao
zhao
zhao
zhao

hive> select name from test group by name;

...

OK
zhao
Time taken: 40.273 seconds, Fetched: 1 row(s)

按照這個去分類，最後結果只有一個，達到了去重的效果；實際上，所謂去重，確定是兩個同樣的才能夠去重，下面試一下兩列的效果：.net

hive> select name,age from test group by name,age;
...

OK
zhao    14
zhao    15
zhao    16
Time taken: 36.943 seconds, Fetched: 3 row(s)

hive> select name,age from test group by name;
FAILED: SemanticException [Error 10025]: Line 1:12 Expression not in GROUP BY key 'age'

只group by name就會出錯，想一下只用name去作那麼age不一樣就無法處理了，也合情合理。code

distinct

這個也比較簡單，就是去重：blog

hive> select distinct name from test;
...

OK
zhao
Time taken: 37.047 seconds, Fetched: 1 row(s)

hive> select distinct name,age from test;
OK
zhao    14
zhao    15
zhao    16
Time taken: 39.131 seconds, Fetched: 3 row(s)

hive> select distinct(name),age from test;
OK
zhao    14
zhao    15
zhao    16
Time taken: 37.739 seconds, Fetched: 3 row(s)