本文會從一個商務分析案例入手,說明SQL窗口函數的使用方式。經過本文的5個需求分析,能夠看出SQL窗口函數的功能十分強大,不只可以使咱們編寫的SQL邏輯更加清晰,並且在某種程度上能夠簡化需求開發。sql
本文主要分析只涉及一張訂單表orders,操做過程在Hive中完成,具體數據以下:函數
-- 建表 CREATE TABLE orders( order_id int, customer_id string, city string, add_time string, amount decimal(10,2)); -- 準備數據 INSERT INTO orders VALUES (1,"A","上海","2020-01-01 00:00:00.000000",200), (2,"B","上海","2020-01-05 00:00:00.000000",250), (3,"C","北京","2020-01-12 00:00:00.000000",200), (4,"A","上海","2020-02-04 00:00:00.000000",400), (5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300), (6,"C","北京","2020-02-19 00:00:00.000000",300), (7,"A","上海","2020-03-01 00:00:00.000000",150), (8,"E","北京","2020-03-05 00:00:00.000000",500), (9,"F","上海","2020-03-09 00:00:00.000000",250), (10,"B","上海","2020-03-21 00:00:00.000000",600);
在業務方面,第m1個月的收入增加計算以下:100 *(m1-m0)/ m0大數據
其中,m1是給定月份的收入,m0是上個月的收入。所以,從技術上講,咱們須要找到每月的收入,而後以某種方式將每月的收入與上一個收入相關聯,以便進行上述計算。計算當時以下:code
WITH monthly_revenue as ( SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders GROUP BY 1 ) ,prev_month_revenue as ( SELECT month, revenue, lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入 FROM monthly_revenue ) SELECT month, revenue, prev_month_revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 1
結果輸出ci
month | revenue | prev_month_revenue | revenue_growth |
---|---|---|---|
2020-01-01 | 650 | NULL | NULL |
2020-02-01 | 1250 | 650 | 92.3 |
2020-03-01 | 1500 | 1250 | 20 |
咱們還能夠按照按城市分組進行統計,查看某個城市某個月份的收入增加狀況開發
WITH monthly_revenue as ( SELECT trunc(add_time,'MM') as month, city, sum(amount) as revenue FROM orders GROUP BY 1,2 ) ,prev_month_revenue as ( SELECT month, city, revenue, lag(revenue) over (partition by city order by month) as prev_month_revenue FROM monthly_revenue ) SELECT month, city, revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 2,1
結果輸出string
month | city | revenue | revenue_growth |
---|---|---|---|
2020-01-01 | 上海 | 450 | NULL |
2020-02-01 | 上海 | 950 | 111.1 |
2020-03-01 | 上海 | 1000 | 5.3 |
2020-01-01 | 北京 | 200 | NULL |
2020-02-01 | 北京 | 300 | 50 |
2020-03-01 | 北京 | 500 | 66.7 |
累計彙總,即當前元素和全部先前元素的總和,以下面的SQL:it
WITH monthly_revenue as ( SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders GROUP BY 1 ) SELECT month, revenue, sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total FROM monthly_revenue ORDER BY 1
結果輸出io
month | revenue | running_total |
---|---|---|
2020-01-01 | 650 | 650 |
2020-02-01 | 1250 | 1900 |
2020-03-01 | 1500 | 3400 |
咱們還可使用下面的組合方式進行分析,SQL以下:table
SELECT order_id, customer_id, city, add_time, amount, sum(amount) over () as amount_total, -- 全部數據求和 sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累計求和 sum(amount) over (partition by customer_id order by add_time rows between unbounded preceding and current row) as running_sum_by_customer, avg(amount) over (order by add_time rows between 5 preceding and current row) as trailing_avg -- 滾動求平均 FROM orders ORDER BY 1
結果輸出:
order_id | customer_id | city | add_time | amount | amount_total | running_sum | running_sum_by_customer | trailing_avg |
---|---|---|---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 3400 | 200 | 200 | 200 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 3400 | 450 | 250 | 225 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 3400 | 650 | 200 | 216.666667 |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 3400 | 1050 | 600 | 262.5 |
5 | D | 上海 | 2020-02-05 00:00:00.000000 | 250 | 3400 | 1300 | 250 | 260 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 3400 | 1600 | 550 | 266.666667 |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 3400 | 1900 | 500 | 283.333333 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3400 | 2050 | 750 | 266.666667 |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 3400 | 2550 | 500 | 316.666667 |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 3400 | 2800 | 250 | 291.666667 |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 3400 | 3400 | 850 |
從上面的數據能夠看出,存在兩條重複的數據**(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然須要對其進行清洗去重,保留最新的一條數據,SQL以下:
咱們先進行分組排名,而後保留最新的那條數據便可:
SELECT * FROM ( SELECT *, row_number() over (partition by order_id order by add_time desc) as rank FROM orders ) t WHERE rank=1
結果輸出:
t.order_id | t.customer_id | t.city | t.add_time | t.amount | t.rank |
---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 1 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 1 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 1 |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 1 |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 1 |
通過上面的清洗過程,對數據進行了去重。從新計算上面的需求1,正確SQL腳本爲:
WITH orders_cleaned as ( SELECT * FROM ( SELECT *, row_number() over (partition by order_id order by add_time desc) as rank FROM orders )t WHERE rank=1 ) ,monthly_revenue as ( SELECT trunc(add_time,'MM') as month, sum(amount) as revenue FROM orders_cleaned GROUP BY 1 ) ,prev_month_revenue as ( SELECT month, revenue, lag(revenue) over (order by month) as prev_month_revenue FROM monthly_revenue ) SELECT month, revenue, round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth FROM prev_month_revenue ORDER BY 1
結果輸出:
month | revenue | revenue_growth |
---|---|---|
2020-01-01 | 650 | NULL |
2020-02-01 | 1000 | 53.8 |
2020-03-01 | 1500 | 50 |
將清洗後的數據建立成視圖,方便之後使用
CREATE VIEW orders_cleaned AS SELECT order_id, customer_id, city, add_time, amount FROM ( SELECT *, row_number() over (partition by order_id order by add_time desc) as rank FROM orders )t WHERE rank=1
分組取topN是最長見的SQL窗口函數使用場景,下面的SQL是計算每月份的top2訂單金額,以下:
WITH orders_ranked as ( SELECT trunc(add_time,'MM') as month, *, row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank FROM orders_cleaned ) SELECT month, order_id, customer_id, city, add_time, amount FROM orders_ranked WHERE rank <=2 ORDER BY 1
下面的SQL計算重複購買率:重複購買的人數/總人數*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)
WITH customer_orders as ( SELECT *, row_number() over (partition by customer_id order by add_time) as customer_order_n, lag(amount) over (partition by customer_id order by add_time) as prev_order_amount FROM orders_cleaned ) SELECT round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重複購買率 avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重複購買較上次購買差別,第一筆訂單金額與第二筆訂單金額之間的典型差額 FROM customer_orders
結果輸出:
WITH結果輸出:
orders_cleaned.order_id | orders_cleaned.customer_id | orders_cleaned.city | orders_cleaned.add_time | orders_cleaned.amount | customer_order_n | prev_order_amount |
---|---|---|---|---|---|---|
1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 | NULL |
4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 2 | 200 |
7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3 | 400 |
2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 | NULL |
10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 2 | 250 |
3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 | NULL |
6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 2 | 200 |
5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 | NULL |
8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 | NULL |
9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 |
最終結果輸出:
repeat_purchases | revenue_expansion |
---|---|
50 | 1.9666666666666668 |
本文主要分享了SQL窗口函數的基本使用方式以及使用場景,並結合了具體的分析案例。經過本文的分析案例,能夠加深對SQL窗口函數的理解。
公衆號『大數據技術與數倉』,回覆『資料』領取大數據資料包