Hive高級應用
一、支持複雜數據類型
array,map,struct
支持對應複雜數據類型的遍歷和查詢正則表達式
二、支持視圖sql
三、函數
3.一、豐富的內置函數
3.二、支持自定義Java處理類,以jar文件的方式添加至Hive,定義臨時函數關聯處理類,對數據進行自定義處理
3.三、Json數據的解析和操做get_json_object,json_tuple
3.三、經過Transform在HQL中調用自定義腳本如Python
3.四、分析窗口函數
a.sum,avg,min,max窗口內聚合分析
over (partition by col1 order by col2 rows between unbounded[n] preceding and current row[n following])
若是不指定ROWS BETWEEN,默認爲從起點到當前行;
若是不指定ORDER BY,則將分組內全部值累加;
關鍵是理解ROWS BETWEEN含義,也叫作WINDOW子句:
PRECEDING:往前
FOLLOWING:日後
CURRENT ROW:當前行
UNBOUNDED:起點,
UNBOUNDED PRECEDING 表示從前面的起點,
UNBOUNDED FOLLOWING:表示到後面的終點
b.Ntile,row_number,ran,dense_ran
NTILE(n) 用於將分組數據按照順序切分紅n片,返回當前切片值
ROW_NUMBER() 從1開始,按照順序,生成分組內記錄的序列,無重複
RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位335
DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位,334
c.cume_dist,percent_rank
CUME_DIST :小於等於當前值的行數/分組內總行數
PERCENT_RANK :分組內當前行的RANK值-1/分組內總行數-1
d.lag,lead,first_value,last_value
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
first_value(col1) over (partition by col2 order by col3)取分組內排序後,截止到當前行,第一個值
last_value(col1) over (partition by col2 order by col3)取分組內排序後,截止到當前行,最後一個值
e.grouping sets,grouping_id,cube,rollup 經常使用於OLAP
grouping sets,grouping_id 將GROUP BY分組字段各個進行聚合,最終結果合併一塊
cube 將GROUP BY分組字段全部組合的聚合
rollup 將GROUP BY分組字段層級組合的聚合json
grouping sets (group by columns list):column list 不一樣組合 grouping__id:給不一樣集合編號 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day grouping sets (month,day[,month,day]) order by grouping__id; cube: with cube 根據group by的維度的全部組合進行聚合 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day with cube order by grouping__id; rollup: with rollup 根據group by的維度順序逐層組合聚合 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day with rollup order by grouping__id; lag(column,n,default):統計窗口內取前n行值,窗口內錯行顯示 lead(column,n,default):窗口內取後n行值,窗口內錯行顯示 eg: select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as front_1_time, LEAD(createtime,2,'2018-12-24 00:00:00') over (partition by cookieid order by createtime) as behind_2_time from cookie4; first_value(column):窗口內,排序第一個值(倒排序即最後一個值) last_value(column):窗口內排序截至當前行的最後一個值,即該列值 select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, first_value(url) over (partition by cookieid order by createtime) as first1, first_value(url) over (partition by cookieid order by createtime desc) as last1, last_value(url) over (partition by cookieid order by createtime) as last2 from cookie4; CUME_DIST():小於等於當前值的行數/分組內總行數 PERCENT_RANK():分組內當前行的RANK值-1/分組內總行數-1 eg: select dept, userid, sal, cume_dist() over (order by sal) as rn1, cume_dist() over (partition by dept order by sal) as rn2 from cookie3; select dept, userid, sal, percent_rank() over (order by sal) as rn1, --分組內 rank() over (order by sal) as rn11, --分組內的rank值 sum(1) over (partition by null) as rn12, --分組內總行數 percent_rank() over (partition by dept order by sal) as rn2, rank() over (partition by dept order by sal) as rn21, sum(1) over (partition by dept) as rn22 from cookie3; ntile(n):將窗口內的數據切成n片,窗口內分塊 row_number():從1開始窗口內記錄的序列 rank():窗口內記錄的排名,335 dense_rank():窗口內記錄的排名,334 eg: select cookieid, createtime, pv, ntile(2) over (partition by cookieid order by createtime) as rn1, row_number() over (partition by cookieid order by pv desc) as rn2, rank() over (partition by cookieid order by pv desc) as rn3, dense_rank() over (partition by cookieid order by pv desc) as rn4 from cookie2 order by cookieid,createtime; sum|avg|min|max(column) over(partition by col1 order by col2 rows between n|unbounded preceding current row and n|unbounded following current row):窗口內記錄的聚合,自由定義窗口聚合範圍 eg: select cookieid,createtime,pv, sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默認爲從起點到當前行 avg(pv) over (partition by cookieid order by createtime) as pv2, -- 從起點到當前行 max(pv) over (partition by cookieid) as pv3, -- 分組內全部行 min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, -- 當前行+往前3行 sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, -- 當前行+往前3行+日後1行 avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 -- 當前行+日後全部行 from cookie1;
四、特殊分隔符處理,regexserde正則表達式解析,自定義inputformat處理cookie
lateral view explode函數
lateral view側視圖用於和split、explode等UDTF一塊兒使用的,能將一行數據拆分紅多行數據,在此基礎上能夠對拆分的數據進行聚合,lateral view首先爲原始表的每行調用UDTF,UDTF會把一行拆分紅一行或者多行,lateral view在把結果組合,產生一個支持別名表的虛擬表。
lateral clause 至關於一個虛擬表,與原表explode_lateral_view笛卡爾積關聯。
explode不能寫在別的函數內url