Spark sql -- Spark sql中的窗口函數和對應的api

1、窗口函數種類

  1. ranking 排名類
  2. analytic 分析類
  3. aggregate 聚合類
Function Type SQL DataFrame API Description
 Ranking  rank   rank rank值多是不連續的
 Ranking  dense_rank  denseRank rank值必定是連續的
 Ranking  percent_rank   percentRank 相同的分組中 (rank -1) / ( count(score) - 1 )
 Ranking  ntile  ntile 將同一組數據循環的往n個桶中放,返回對應的桶的index,index從1開始
 Ranking  row_number  rowNumber 很單純的行號,相似excel的行號
 Analytic   cume_dist  cumeDist  
 Analytic   first_value   firstValue 相同的分組中最小值
 Analytic   last_value   lastValue 相同的分組中最大值
 Analytic   lag  lag 取前n行數據
 Analytic   lead  lead 取後n行數據
 Aggregate   min min 最小值
 Aggregate   max max 最大值
 Aggregate   sum sum 求和
 Aggregate   avg avg 求平均

2、具體用法以下

count(...) over(partition by ... order by ...)--求分組後的總數。
sum(...) over(partition by ... order by ...)--求分組後的和。
max(...) over(partition by ... order by ...)--求分組後的最大值。
min(...) over(partition by ... order by ...)--求分組後的最小值。
avg(...) over(partition by ... order by ...)--求分組後的平均值。
rank() over(partition by ... order by ...)--rank值多是不連續的。
dense_rank() over(partition by ... order by ...)--rank值是連續的。
first_value(...) over(partition by ... order by ...)--求分組內的第一個值。
last_value(...) over(partition by ... order by ...)--求分組內的最後一個值。
lag() over(partition by ... order by ...)--取出前n行數據。  
lead() over(partition by ... order by ...)--取出後n行數據。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括號中就是分子,over() 括號中就是分母。
percent_rank() over(partition by ... order by ...)--html

3、實際例子

案例數據:/root/score.json/score.json,學生名字、課程、分數sql

{"name":"A","lesson":"Math","score":100} {"name":"B","lesson":"Math","score":100} {"name":"C","lesson":"Math","score":99} {"name":"D","lesson":"Math","score":98} {"name":"A","lesson":"E","score":100} {"name":"B","lesson":"E","score":99} {"name":"C","lesson":"E","score":99} {"name":"D","lesson":"E","score":98}
select
name,lesson,score,
ntile(2) over (partition by lesson order by score desc ) as ntile_2,
ntile(3) over (partition by lesson order by score desc ) as ntile_3,
row_number() over (partition by lesson order by score desc ) as row_number,
rank() over (partition by lesson order by score desc ) as rank,
dense_rank() over (partition by lesson order by score desc ) as dense_rank, 
percent_rank() over (partition by lesson order by score desc ) as percent_rank 
from score 
order by lesson,name,score

輸出結果徹底同樣,以下表所示json

name lesson score ntile_2 ntile_3 row_number rank dense_rank percent_rank
A E 100 1 1 1 1 1 0.0
B E 99 1 1 2 2 2 0.3333333333333333
C E 99 2 2 3 2 2 0.3333333333333333
D E 98 2 3 4 4 3 1.0
A Math 100 1 1 1 1 1 0.0
B Math 100 1 1 2 1 1 0.0
C Math 99 2 2 3 3 2 0.6666666666666666
D Math 98 2 3 4 4 3 1.0

參考:

spark sql中的窗口函數less

over(partition by) 函數函數

 

=================================================================================spa

原創文章,轉載請務必將下面這段話置於文章開頭處(保留超連接)。
本文轉發自程序媛說事兒,原文連接http://www.javashuo.com/article/p-xkazofhw-ho.html

=================================================================================
.net

相關文章
相關標籤/搜索