Hive 高級應用及分析窗口函數

Hive高級應用
一、支持複雜數據類型
array,map,struct
支持對應複雜數據類型的遍歷和查詢正則表達式

二、支持視圖sql

三、函數
3.一、豐富的內置函數
3.二、支持自定義Java處理類,以jar文件的方式添加至Hive,定義臨時函數關聯處理類,對數據進行自定義處理
3.三、Json數據的解析和操做get_json_object,json_tuple
3.三、經過Transform在HQL中調用自定義腳本如Python
3.四、分析窗口函數
a.sum,avg,min,max窗口內聚合分析
over (partition by col1 order by col2 rows between unbounded[n] preceding and current row[n following])
若是不指定ROWS BETWEEN,默認爲從起點到當前行;
若是不指定ORDER BY,則將分組內全部值累加;
關鍵是理解ROWS BETWEEN含義,也叫作WINDOW子句:
PRECEDING:往前
FOLLOWING:日後
CURRENT ROW:當前行
UNBOUNDED:起點,
UNBOUNDED PRECEDING 表示從前面的起點,
UNBOUNDED FOLLOWING:表示到後面的終點
b.Ntile,row_number,ran,dense_ran
NTILE(n) 用於將分組數據按照順序切分紅n片,返回當前切片值
ROW_NUMBER() 從1開始,按照順序,生成分組內記錄的序列,無重複
RANK() 生成數據項在分組中的排名,排名相等會在名次中留下空位335
DENSE_RANK() 生成數據項在分組中的排名,排名相等會在名次中不會留下空位,334
c.cume_dist,percent_rank
CUME_DIST :小於等於當前值的行數/分組內總行數
PERCENT_RANK :分組內當前行的RANK值-1/分組內總行數-1
d.lag,lead,first_value,last_value
LAG(col,n,DEFAULT) 用於統計窗口內往上第n行值
LEAD(col,n,DEFAULT) 用於統計窗口內往下第n行值
first_value(col1) over (partition by col2 order by col3)取分組內排序後,截止到當前行,第一個值
last_value(col1) over (partition by col2 order by col3)取分組內排序後,截止到當前行,最後一個值
e.grouping sets,grouping_id,cube,rollup 經常使用於OLAP
grouping sets,grouping_id 將GROUP BY分組字段各個進行聚合,最終結果合併一塊
cube 將GROUP BY分組字段全部組合的聚合
rollup 將GROUP BY分組字段層級組合的聚合json

grouping sets (group by columns list):column list 不一樣組合
grouping__id:給不一樣集合編號
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
grouping sets (month,day[,month,day])
order by grouping__id;

cube: with cube 根據group by的維度的全部組合進行聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with cube
order by grouping__id; 

rollup: with rollup 根據group by的維度順序逐層組合聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with rollup
order by grouping__id; 

lag(column,n,default):統計窗口內取前n行值,窗口內錯行顯示
lead(column,n,default):窗口內取後n行值,窗口內錯行顯示
eg:
select cookieid, createtime, url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as front_1_time, 
  LEAD(createtime,2,'2018-12-24 00:00:00') over (partition by cookieid order by createtime) as behind_2_time 
from  cookie4;

first_value(column):窗口內,排序第一個值(倒排序即最後一個值)
last_value(column):窗口內排序截至當前行的最後一個值,即該列值
select  cookieid, createtime, url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  first_value(url) over (partition by cookieid order by createtime) as first1,
  first_value(url) over (partition by cookieid order by createtime desc) as last1,
  last_value(url) over (partition by cookieid order by createtime) as last2 
from cookie4;

CUME_DIST():小於等於當前值的行數/分組內總行數
PERCENT_RANK():分組內當前行的RANK值-1/分組內總行數-1
eg:
select  dept, userid, sal,
  cume_dist() over (order by sal) as rn1,
  cume_dist() over (partition by dept order by sal) as rn2
from cookie3;
select  dept, userid, sal,
  percent_rank() over (order by sal) as rn1, --分組內
  rank() over (order by sal) as rn11, --分組內的rank值
  sum(1) over (partition by null) as rn12, --分組內總行數
  percent_rank() over (partition by dept order by sal) as rn2,
  rank() over (partition by dept order by sal) as rn21,
  sum(1) over (partition by dept) as rn22 
from cookie3;

ntile(n):將窗口內的數據切成n片,窗口內分塊
row_number():從1開始窗口內記錄的序列
rank():窗口內記錄的排名,335
dense_rank():窗口內記錄的排名,334
eg:
select cookieid, createtime, pv,
  ntile(2) over (partition by cookieid order by createtime) as rn1,
  row_number() over (partition by cookieid order by pv desc) as rn2,
  rank() over (partition by cookieid order by pv desc) as rn3,
  dense_rank() over (partition by cookieid order by pv desc) as rn4
from  cookie2 
order by cookieid,createtime;

sum|avg|min|max(column) over(partition by col1 order by col2 rows between n|unbounded preceding current row and n|unbounded following current row):窗口內記錄的聚合,自由定義窗口聚合範圍
eg:
select cookieid,createtime,pv, 
   sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默認爲從起點到當前行
   avg(pv) over (partition by cookieid order by createtime) as pv2,                                                  -- 從起點到當前行
   max(pv) over (partition by cookieid) as pv3,                                                                      -- 分組內全部行
   min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,         -- 當前行+往前3行
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5,         -- 當前行+往前3行+日後1行
   avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6  -- 當前行+日後全部行
from cookie1;

四、特殊分隔符處理,regexserde正則表達式解析,自定義inputformat處理cookie

lateral view explode函數

lateral view側視圖用於和split、explode等UDTF一塊兒使用的,能將一行數據拆分紅多行數據,在此基礎上能夠對拆分的數據進行聚合,lateral view首先爲原始表的每行調用UDTF,UDTF會把一行拆分紅一行或者多行,lateral view在把結果組合,產生一個支持別名表的虛擬表。
lateral clause 至關於一個虛擬表,與原表explode_lateral_view笛卡爾積關聯。
explode不能寫在別的函數內url

相關文章
相關標籤/搜索