hive裏的group by和distinct

hive裏的group by和distinct

前言

今天才明確知道group by實際上仍是有去重讀做用的,其實細想一下,按照xx分類,確定相同的就算是一類了,也就至關於去重來,詳細的看一下。面試

group by
  • 看一下實例1:
hive> select * from test;
OK
zhao    15  20170807
zhao    14  20170809
zhao    15  20170809
zhao    16  20170809

hive> select name from test;
OK
zhao
zhao
zhao
zhao

hive> select name from test group by name;

...

OK
zhao
Time taken: 40.273 seconds, Fetched: 1 row(s)

按照這個去分類,最後結果只有一個,達到了去重的效果;實際上,所謂去重,確定是兩個同樣的才能夠去重,下面試一下兩列的效果:.net

hive> select name,age from test group by name,age;
...

OK
zhao    14
zhao    15
zhao    16
Time taken: 36.943 seconds, Fetched: 3 row(s)

hive> select name,age from test group by name;
FAILED: SemanticException [Error 10025]: Line 1:12 Expression not in GROUP BY key 'age'

只group by name就會出錯,想一下只用name去作那麼age不一樣就無法處理了,也合情合理。code

distinct

這個也比較簡單,就是去重:blog

hive> select distinct name from test;
...

OK
zhao
Time taken: 37.047 seconds, Fetched: 1 row(s)

hive> select distinct name,age from test;
OK
zhao    14
zhao    15
zhao    16
Time taken: 39.131 seconds, Fetched: 3 row(s)

hive> select distinct(name),age from test;
OK
zhao    14
zhao    15
zhao    16
Time taken: 37.739 seconds, Fetched: 3 row(s)
區別
  • 若是數據較多,distinct效率會更低一些,通常推薦使用group by。
  • 至於緣由,推薦這篇文章
相關文章
相關標籤/搜索