使用Hive SQL窗口函數進行商務數據分析

本文會從一個商務分析案例入手,說明SQL窗口函數的使用方式。經過本文的5個需求分析,能夠看出SQL窗口函數的功能十分強大,不只可以使咱們編寫的SQL邏輯更加清晰,並且在某種程度上能夠簡化需求開發。sql

數據準備

本文主要分析只涉及一張訂單表orders,操做過程在Hive中完成,具體數據以下:函數

-- 建表
CREATE TABLE orders(
    order_id int,
    customer_id string,
    city string,
    add_time string,
    amount decimal(10,2));

-- 準備數據                              
INSERT INTO orders VALUES
(1,"A","上海","2020-01-01 00:00:00.000000",200),
(2,"B","上海","2020-01-05 00:00:00.000000",250),
(3,"C","北京","2020-01-12 00:00:00.000000",200),
(4,"A","上海","2020-02-04 00:00:00.000000",400),
(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),
(6,"C","北京","2020-02-19 00:00:00.000000",300),
(7,"A","上海","2020-03-01 00:00:00.000000",150),
(8,"E","北京","2020-03-05 00:00:00.000000",500),
(9,"F","上海","2020-03-09 00:00:00.000000",250),
(10,"B","上海","2020-03-21 00:00:00.000000",600);

需求1:收入增加

在業務方面,第m1個月的收入增加計算以下:100 *(m1-m0)/ m0大數據

其中,m1是給定月份的收入,m0是上個月的收入。所以,從技術上講,咱們須要找到每月的收入,而後以某種方式將每月的收入與上一個收入相關聯,以便進行上述計算。計算當時以下:code

WITH
monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1
)
,prev_month_revenue as (
    SELECT 
    month,
    revenue,
    lag(revenue) over (order by month) as prev_month_revenue -- 上一月收入
    FROM monthly_revenue
)
SELECT 
  month,
  revenue,
  prev_month_revenue,
  round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1

結果輸出ci

month revenue prev_month_revenue revenue_growth
2020-01-01 650 NULL NULL
2020-02-01 1250 650 92.3
2020-03-01 1500 1250 20

咱們還能夠按照按城市分組進行統計,查看某個城市某個月份的收入增加狀況開發

WITH
monthly_revenue as (
    SELECT
     trunc(add_time,'MM') as month,
    city,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1,2
)
,prev_month_revenue as (
    SELECT 
    month,
    city,
    revenue,
    lag(revenue) over (partition by city order by month) as prev_month_revenue
    FROM monthly_revenue
)
SELECT 
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 2,1

結果輸出string

month city revenue revenue_growth
2020-01-01 上海 450 NULL
2020-02-01 上海 950 111.1
2020-03-01 上海 1000 5.3
2020-01-01 北京 200 NULL
2020-02-01 北京 300 50
2020-03-01 北京 500 66.7

需求2:累計求和

累計彙總,即當前元素和全部先前元素的總和,以下面的SQL:it

WITH
monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders
    GROUP BY 1
)
SELECT 
month,
revenue,
sum(revenue) over (order by month rows between unbounded preceding and current row) as running_total
FROM monthly_revenue
ORDER BY 1

結果輸出io

month revenue running_total
2020-01-01 650 650
2020-02-01 1250 1900
2020-03-01 1500 3400

咱們還可使用下面的組合方式進行分析,SQL以下:table

SELECT
   order_id,
   customer_id,
   city,
   add_time,
   amount,
   sum(amount) over () as amount_total, -- 全部數據求和
   sum(amount) over (order by order_id rows between unbounded preceding and current row) as running_sum, -- 累計求和
   sum(amount) over (partition by customer_id order by add_time rows between unbounded    preceding and current row) as running_sum_by_customer, 
   avg(amount) over (order by add_time rows between 5 preceding and current row) as  trailing_avg -- 滾動求平均
FROM orders
ORDER BY 1

結果輸出

order_id customer_id city add_time amount amount_total running_sum running_sum_by_customer trailing_avg
1 A 上海 2020-01-01 00:00:00.000000 200 3400 200 200 200
2 B 上海 2020-01-05 00:00:00.000000 250 3400 450 250 225
3 C 北京 2020-01-12 00:00:00.000000 200 3400 650 200 216.666667
4 A 上海 2020-02-04 00:00:00.000000 400 3400 1050 600 262.5
5 D 上海 2020-02-05 00:00:00.000000 250 3400 1300 250 260
5 D 上海 2020-02-05 12:00:00.000000 300 3400 1600 550 266.666667
6 C 北京 2020-02-19 00:00:00.000000 300 3400 1900 500 283.333333
7 A 上海 2020-03-01 00:00:00.000000 150 3400 2050 750 266.666667
8 E 北京 2020-03-05 00:00:00.000000 500 3400 2550 500 316.666667
9 F 上海 2020-03-09 00:00:00.000000 250 3400 2800 250 291.666667
10 B 上海 2020-03-21 00:00:00.000000 600 3400 3400 850

需求3:處理重複數據

從上面的數據能夠看出,存在兩條重複的數據**(5,"D","上海","2020-02-05 00:00:00.000000",250),
(5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然須要對其進行清洗去重,保留最新的一條數據,SQL以下:

咱們先進行分組排名,而後保留最新的那條數據便可:

SELECT *
FROM (
    SELECT *,
    row_number() over (partition by order_id order by add_time desc) as rank
    FROM orders
) t
WHERE rank=1

結果輸出

t.order_id t.customer_id t.city t.add_time t.amount t.rank
1 A 上海 2020-01-01 00:00:00.000000 200 1
2 B 上海 2020-01-05 00:00:00.000000 250 1
3 C 北京 2020-01-12 00:00:00.000000 200 1
4 A 上海 2020-02-04 00:00:00.000000 400 1
5 D 上海 2020-02-05 12:00:00.000000 300 1
6 C 北京 2020-02-19 00:00:00.000000 300 1
7 A 上海 2020-03-01 00:00:00.000000 150 1
8 E 北京 2020-03-05 00:00:00.000000 500 1
9 F 上海 2020-03-09 00:00:00.000000 250 1
10 B 上海 2020-03-21 00:00:00.000000 600 1

通過上面的清洗過程,對數據進行了去重。從新計算上面的需求1,正確SQL腳本爲:

WITH
orders_cleaned as (
    SELECT *
    FROM (
        SELECT *,
        row_number() over (partition by order_id order by add_time desc) as rank
        FROM orders
    )t
    WHERE rank=1
)
,monthly_revenue as (
    SELECT
    trunc(add_time,'MM') as month,
    sum(amount) as revenue
    FROM orders_cleaned
    GROUP BY 1
)
,prev_month_revenue as (
    SELECT 
    month,
    revenue,
    lag(revenue) over (order by month) as prev_month_revenue
    FROM monthly_revenue
)
SELECT 
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1) as revenue_growth
FROM prev_month_revenue
ORDER BY 1

結果輸出

month revenue revenue_growth
2020-01-01 650 NULL
2020-02-01 1000 53.8
2020-03-01 1500 50

將清洗後的數據建立成視圖,方便之後使用

CREATE VIEW orders_cleaned AS
SELECT
    order_id, 
    customer_id, 
    city, 
    add_time, 
    amount
FROM (
    SELECT *,
    row_number() over (partition by order_id order by add_time desc) as rank
    FROM orders
)t
WHERE rank=1

需求4:分組取TopN

分組取topN是最長見的SQL窗口函數使用場景,下面的SQL是計算每月份的top2訂單金額,以下:

WITH orders_ranked as (
    SELECT
    trunc(add_time,'MM') as month,
    *,
    row_number() over (partition by trunc(add_time,'MM') order by amount desc, add_time) as rank
    FROM orders_cleaned
)
SELECT 
    month,
    order_id,
    customer_id,
    city,
    add_time,
    amount
FROM orders_ranked
WHERE rank <=2
ORDER BY 1

需求5:重複購買行爲

下面的SQL計算重複購買率:重複購買的人數/總人數*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)

WITH customer_orders as (
    SELECT *,
    row_number() over (partition by customer_id order by add_time) as customer_order_n,
    lag(amount) over (partition by customer_id order by add_time) as prev_order_amount
    FROM orders_cleaned
)
SELECT
round(100.0*sum(case when customer_order_n=2 then 1 end)/count(distinct customer_id),1) as repeat_purchases,-- 重複購買率
avg(case when customer_order_n=2 then 1.0*amount/prev_order_amount end) as revenue_expansion -- 重複購買較上次購買差別,第一筆訂單金額與第二筆訂單金額之間的典型差額
FROM customer_orders

結果輸出

WITH結果輸出:

orders_cleaned.order_id orders_cleaned.customer_id orders_cleaned.city orders_cleaned.add_time orders_cleaned.amount customer_order_n prev_order_amount
1 A 上海 2020-01-01 00:00:00.000000 200 1 NULL
4 A 上海 2020-02-04 00:00:00.000000 400 2 200
7 A 上海 2020-03-01 00:00:00.000000 150 3 400
2 B 上海 2020-01-05 00:00:00.000000 250 1 NULL
10 B 上海 2020-03-21 00:00:00.000000 600 2 250
3 C 北京 2020-01-12 00:00:00.000000 200 1 NULL
6 C 北京 2020-02-19 00:00:00.000000 300 2 200
5 D 上海 2020-02-05 12:00:00.000000 300 1 NULL
8 E 北京 2020-03-05 00:00:00.000000 500 1 NULL
9 F 上海 2020-03-09 00:00:00.000000 250

最終結果輸出:

repeat_purchases revenue_expansion
50 1.9666666666666668

總結

本文主要分享了SQL窗口函數的基本使用方式以及使用場景,並結合了具體的分析案例。經過本文的分析案例,能夠加深對SQL窗口函數的理解。

公衆號『大數據技術與數倉』,回覆『資料』領取大數據資料包
相關文章
相關標籤/搜索