Hive SQL 常見問題（轉載）

http://www.aboutyun.com/thread-14942-1-1.html

問題導讀
一、Hive查詢語句和SQL查詢語句區別與聯繫。
二、distribute by、group by和Sort by的區別。
三、MapJoin的優缺點是什麼？



聚合函數
1.count計數
count(*)：不全都是NULL，就加1；count(1):當只要有一列是NULL就不會加1；count(col)：當col列不爲空就會加1
2.sum求和
sum(可轉成數字的值)返回bigint，好比求和後加1,1必須轉化成爲bigint類型，sum(col)+cast(1 as bigint)
3.avg求平均值
avg(可轉化成數字的值)返回double
4.distinct不一樣值的個數
count(distinct col)

Order by
按照某些字段排序，後面能夠有不少列進行排序，默認是字典排序。Order by 是全局排序，Order by這個過程只須要一個且只有一個reduce，不管你配置多少各reduce，到最後都會彙總一個reduce進行Order by，因此當數據量很是大的時候要考慮好是否使用Order by；
select c1,other from tableName where condition order by c1,c2[asc|desc]
如：select * from usertable order id asc,name desc;

Group by
根據某些字段值進行分組，有相同值的放到一塊兒。注意，select後面查詢的列，除了聚合函數外都應該出現再group by中，group by後面也能夠跟表達式，如substr(col),對某一列字段截取的部分進行分組。having是在group by以後進一步進行篩選的。他使用了reduce操做，受限於reduce個數，設置reduce參數mapred.reduce.tasks。文件的輸出個數就是reduce個數，文件大小與reduce處理的數據量相關。存在的問題：網絡負載太重；數據傾斜，優化參數hive.groupby,skeweindata設置爲true(起兩個mapreduce，第一個在每一個key加一個隨機值，而後進行group by)。

select c1,[c2..] count(1).. from test_group where condition group by c1[,c2..] [having]
set mapred.reduce.tasks=n;
set hive.groupby.skewindata=true;
select name,count(1) as num from tableName group by name;

group by 與distinct
都能達到去重的效果，走的mapreduce都是同樣的流程，其reduce個數優化，還有hive.groupby.skewindata都起做用。group by中查找的必須在group by 中出現，可是distinct能夠，可是distinct須要後面的多個col都相同纔會去重。

Join錶鏈接
兩個表m和n之間按照on條件進行鏈接，m中的一條記錄和n中的一條記錄組成一條新的記錄。
join：等值鏈接(內鏈接)，只有m和n中同時存在纔會篩選出。
如：
tableA col1 col2
       1   w
       2   j
       4   g
tableB col3 col4
      1   y
      1   t
      5   z
select a.col1,a.col2,b.col4 from (select col1,col2 from tableA) a join (select col3,col4 from tableB) b on a.col1=b.col3
     result:
     1 w y
     1 w t
left outer join：左邊的表的值不管是否在右表中出現，都會做爲結果輸出，若是右邊的值不存在則爲NULL。而右邊表的值只有在左邊的值出現纔會出現。
select a.col1,a.col2,b.col4 from (select col1,col2 from tableA) a left outer join (select col3,col4 tableB) b on a.col1=b.col3
    result:
    1 w y
    1 w t
    2 j NULL
    4 g NULL
right outer join：與left outer join正相反
  result：
    1 w y
    1 j y
    5 NULL z
left semi join :相似exists，判斷左表中數據是否在右表中，若是在則篩選出來。
select a.col1,a.col2,b.col4 from (select col1,col2 from tableA) a left semi join (select col3,col4 from tableB) b on a.col1=b.col3
    result:
    1 w y
mapjoin：在map端完成join操做，不須要reduce，基於內存作join，屬於優化操做，可是須要將整個表加載到內存中去。

select m.col1 as col1,m.col2 as col2,n.col3 as col3
from 
(select col1,col2 from tableA where condition(在map端執行) )m
[left outer|right outer|left semi] join
(select col3 from tableB)n
on m.col1= n.col3 [and condition]
where condition(在reduce端執行)
注：hive中join不能使用相似<,>這種比較 
若是尋在數據傾斜問題能夠設置：
set hive.optimize.skewjoin=true;

MapJoin
在map端把小表加載到內存中，而後讀取大表，和內存中的小表進行比較。其中使用了分佈式緩存。
優缺點：
不消耗reduce資源(reduce相對較小)；減小reduce操做，加快程序執行；下降網絡負載；
佔用部份內存，因此加載到內存中的表不能太大，由於每一個計算節點都會加載一次。生成較多小文件。(每一個map對應一個輸出文件)
使用MapJoin：
自動使用
配置：set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=n(當查詢的表中小表小於等於這個值就會使用MapJoin。可是可能會有不少任務，都加載進去了消耗內存)
手動指定
slelet /*+mapjoin(n)*/ m.col1,m.col2,n.col4 from (...)m join (...)n on m.col1=n.col3
也就是在select中將小表使用/*+mapjoin(n)*/參數加載到內存中去

Distribute by 和 Sort by
distribute by col1,col2
按照col列將數據分散到不一樣的reduce
sort by col1，col2[asc|desc]
將col1，col2列排序(這是將每一個reduce內部排序)
二者接合使用，確保每一個reduce的輸出都是有序的，可是總體不是有序的。

distribute by 和 group by
都是按照key劃分數據
都是用reduce操做
distribute by只是單純的分散數據，而group by是將相同的key彙集到一塊兒，後續必須是聚合操做
order by 和sort by
order by確保全局有序
sort by只是保證每一個reduce上面輸出有序，若是隻有一個reduce時，和order by效果同樣
應用場景
小文件過多(經過reduce個數來控制輸出文件個數)
文件超大
map輸出的文件大小不均
reduce輸出的文件大小不均

cluster by 
把有相同的數據彙集到一塊兒，並排序
cluster by col
至關於：distribute by col col order by col

union all
將多個表的數據合併成一個表，hive部支持union操做
select col 
from(
select a as col from t1
union all
select b as col from t2
)tmp
沒有reduce操做
要求：
字段名字同樣(能夠經過as使用別名)
字段類型同樣
字段個數同樣(子查詢的個數)
字表不能又別名(join每一個子查詢要別名)
合併以後的表若是須要查詢數據，則須要給合併結果起別名