Function Type | SQL | DataFrame API | Description |
Ranking | rank | rank | rank值多是不連續的 |
Ranking | dense_rank | denseRank | rank值必定是連續的 |
Ranking | percent_rank | percentRank | 相同的分組中 (rank -1) / ( count(score) - 1 ) |
Ranking | ntile | ntile | 將同一組數據循環的往n個桶中放,返回對應的桶的index,index從1開始 |
Ranking | row_number | rowNumber | 很單純的行號,相似excel的行號 |
Analytic | cume_dist | cumeDist | |
Analytic | first_value | firstValue | 相同的分組中最小值 |
Analytic | last_value | lastValue | 相同的分組中最大值 |
Analytic | lag | lag | 取前n行數據 |
Analytic | lead | lead | 取後n行數據 |
Aggregate | min | min | 最小值 |
Aggregate | max | max | 最大值 |
Aggregate | sum | sum | 求和 |
Aggregate | avg | avg | 求平均 |
count(...) over(partition by ... order by ...)--求分組後的總數。
sum(...) over(partition by ... order by ...)--求分組後的和。
max(...) over(partition by ... order by ...)--求分組後的最大值。
min(...) over(partition by ... order by ...)--求分組後的最小值。
avg(...) over(partition by ... order by ...)--求分組後的平均值。
rank() over(partition by ... order by ...)--rank值多是不連續的。
dense_rank() over(partition by ... order by ...)--rank值是連續的。
first_value(...) over(partition by ... order by ...)--求分組內的第一個值。
last_value(...) over(partition by ... order by ...)--求分組內的最後一個值。
lag() over(partition by ... order by ...)--取出前n行數據。
lead() over(partition by ... order by ...)--取出後n行數據。
ratio_to_report() over(partition by ... order by ...)--Ratio_to_report() 括號中就是分子,over() 括號中就是分母。
percent_rank() over(partition by ... order by ...)--html
案例數據:/root/score.json/score.json,學生名字、課程、分數sql
{"name":"A","lesson":"Math","score":100} {"name":"B","lesson":"Math","score":100} {"name":"C","lesson":"Math","score":99} {"name":"D","lesson":"Math","score":98} {"name":"A","lesson":"E","score":100} {"name":"B","lesson":"E","score":99} {"name":"C","lesson":"E","score":99} {"name":"D","lesson":"E","score":98}
select name,lesson,score, ntile(2) over (partition by lesson order by score desc ) as ntile_2, ntile(3) over (partition by lesson order by score desc ) as ntile_3, row_number() over (partition by lesson order by score desc ) as row_number, rank() over (partition by lesson order by score desc ) as rank, dense_rank() over (partition by lesson order by score desc ) as dense_rank, percent_rank() over (partition by lesson order by score desc ) as percent_rank from score order by lesson,name,score
輸出結果徹底同樣,以下表所示json
name | lesson | score | ntile_2 | ntile_3 | row_number | rank | dense_rank | percent_rank |
---|---|---|---|---|---|---|---|---|
A | E | 100 | 1 | 1 | 1 | 1 | 1 | 0.0 |
B | E | 99 | 1 | 1 | 2 | 2 | 2 | 0.3333333333333333 |
C | E | 99 | 2 | 2 | 3 | 2 | 2 | 0.3333333333333333 |
D | E | 98 | 2 | 3 | 4 | 4 | 3 | 1.0 |
A | Math | 100 | 1 | 1 | 1 | 1 | 1 | 0.0 |
B | Math | 100 | 1 | 1 | 2 | 1 | 1 | 0.0 |
C | Math | 99 | 2 | 2 | 3 | 3 | 2 | 0.6666666666666666 |
D | Math | 98 | 2 | 3 | 4 | 4 | 3 | 1.0 |
參考:spark sql中的窗口函數less
=================================================================================spa
原創文章,轉載請務必將下面這段話置於文章開頭處(保留超連接)。
本文轉發自程序媛說事兒,原文連接http://www.javashuo.com/article/p-xkazofhw-ho.html
=================================================================================.net