DolphinDB是新一代的高性能分佈式時序數據庫(time-series database),同時具備豐富的數據分析和分佈式計算功能。本教程使用DolphinDB對淘寶APP的用戶行爲數據進行分析,進一步分析業務問題。php
數據來源:User Behavior Data from Taobao for Recommendation-數據集-阿里雲天池docker
本教程中,咱們把DolphinDB database以及使用的數據集封裝到docker中。docker中包含了DolphinDB的分佈式數據庫dfs://user_behavior 。它包含一張表user,保存了2017年11月25日到2017年12月3日之間將近一百萬淘寶APP用戶的行爲記錄。咱們採用組合分區方式,第一層按照日期分區,天天一個分區,第二層按照userID進行哈希分區,一共劃分爲180個分區。user表的結構以下所示:數據庫
各類用戶行爲類型的含義以下:瀏覽器
- pv:瀏覽商品詳情頁
- buy:商品購買
- cart:將商品加入購物車
- fav:收藏商品
1. 下載docker部署包
本教程已經把DolphinDB以及用到的數據封裝到docker容器中。使用前確保docker環境已經部署好。docker安裝教程請參考https://docs.docker.com/install/。從http://www.dolphindb.cn/downloads/bigdata.tar.gz下載部署包,到部署包所在目錄執行如下代碼。bash
解壓部署包:框架
gunzip bigdata.tar.gz
導入容器快照做爲鏡像:分佈式
cat bigdata.tar | docker import - my/bigdata:v1
獲取鏡像my/bigdata:v1的ID:函數
docker images
啓動容器(根據實際狀況替換images id):性能
docker run -dt -p 8888:8848 --name test <image id> /bin/bash ./dolphindb/start.sh
在瀏覽器地址欄中輸入本機IP地址:8888,如localhost:8888,進入DolphinDB Notebook。如下代碼均在DolphinDB Notebook中執行。學習
該docker中的DolphinDB license有效期到2019年9月1日,若是license文件過時,只須要到DolphinDB官網下載社區版,用社區版的license替換bigdata.tar/dolphindb/dolphindb.lic便可。
2. 用戶行爲分析
查看數據量:
login("admin","123456") user=loadTable("dfs://user_behavior","user") select count(*) from user
98914533
user表中一共有98,914,533條記錄。
分析用戶從瀏覽到最終購買商品整個過程的行爲狀況:
PV=exec count(*) from user where behavior="pv"
88596903
UV=count(exec distinct userID from user)
987984
在這9天中,淘寶APP的頁面訪問量爲88,596,903,獨立訪客爲987,984。
上面使用到的exec是DolphinDB獨有的功能,它與select相似。二者的區別是,select語句老是返回一個表,exec選擇一列時會返回一個向量,與聚合函數一塊兒使用時會返回一個標量,與pivoy by一塊兒使用時會返回一個矩陣,方便後續對數據的計算。
統計只瀏覽一次頁面的用戶數量:
onceUserNum=count(select count(behavior) from user group by userID having count(behavior)=1)
92
jumpRate=onceUserNum\UV*100
0.009312
只有92個用戶只瀏覽過一個頁面就離開了APP,佔總用戶數的0.0093%,幾乎能夠忽略不計,說明淘寶有足夠的吸引力讓用戶停留在APP中。
統計各個用戶行爲的數量:
behaviors=select count(*) as num from user group by behavior
計算從有瀏覽到有意向購買的轉化率:
將商品加入購物車和收藏商品均可以認爲用戶有意向購買。統計有意向購買的用戶行爲數量:
fav_cart=exec sum(num) from behaviors where behavior="fav" or behavior="cart"
8318654
intentRate=fav_cart\PV*100
9.389328
從瀏覽到有意向購買只有9.38%的轉化率。
buy=(exec num from behaviors where behavior="buy")[0]
1998976
buyRate=buy\PV*100
2.256259
intent_buy=buy\fav_cart*100
24.030041
從瀏覽到最終購買只有2.25%的轉化率,從有意向購買到最終購買的轉化率爲24.03%,說明大部分用戶用戶會把中意的商品收藏或加入購物車,但不必定會當即購買。
對各類用戶行爲的獨立訪客進行統計:
userNums=select count(userID) as num from (select count(*) from user group by behavior,userID) group by behavior
pay_user_rate=(exec num from userNums where behavior="buy")[0]\UV*100
67.852313
這9天中,使用淘寶APP的付費用戶佔67.8%,說明大部分用戶會在淘寶APP上購物。
統計天天各類用戶行爲的用戶數量:
dailyUserNums=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by date(behaveTime) as date
周5、週六和週日(2017.11.2五、2017.11.2六、2017.12.0二、2017.12.03)淘寶APP的訪問量明顯增長。
iif是DolphinDB的條件運算符,它的語法是iif(cond, trueResult, falseResult),cond一般是布爾表達式,若是知足cond,則返回trueResult,若是不知足cond,則返回falseResult。
分別統計天天不一樣時間段下各類用戶行爲的數量。咱們提供瞭如下兩種方法:
第一種方法是分別統計各個時間段的數據,再把各個結果合併。例如,統計工做日2017.11.29(週三)不一樣時間段的用戶行爲數量。
re1=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T00:00:00 : 2017.11.29T05:59:59 re2=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T06:00:00 : 2017.11.29T08:59:59 re3=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T09:00:00 : 2017.11.29T11:59:59 re4=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T12:00:00 : 2017.11.29T13:59:59 re5=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T14:00:00 : 2017.11.29T17:59:59 re6=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T18:00:00 : 2017.11.29T21:59:59 re7=select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between 2017.11.29T22:00:00 : 2017.11.29T23:59:59 re=unionAll([re1,re2,re3,re4,re5,re6,re7],false)
這種方法比較簡單,可是須要編寫大量重複代碼。固然也能夠把重複代碼封裝成函數。
def calculateBehavior(startTime,endTime){ return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user where behaveTime between startTime : endTime }
這樣只須要指定時間段的起始時間便可。
另一種方法是經過DolphinDB的Map-Reduce框架來完成。例如,統計工做日2017.11.29(週三)的用戶行爲。
def caculate(t){ return select first(behaveTime) as time, sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from t } ds1 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.29T00:00:00 2017.11.29T06:00:000 2017.11.29T09:00:00 2017.11.29T12:00:00 2017.11.29T14:00:00 2017.11.29T18:00:00 2017.11.29T22:00:00 2017.11.29T23:59:59) WedBehavior = mr(ds1, caculate, , unionAll{, false})
咱們使用repartitionDS函數對user表從新按照時間範圍來分區(不改變user表原來的分區方式),並生成多個數據源,而後經過mr函數,對數據源進行並行計算。DolphinDB會把caculate函數應用到各個數據源上,而後把各個結果合併。
工做日,凌晨(0點到6點)淘寶APP的使用率最高,其次是下午(14點到16點)。
統計週六(2017.11.25)和週日(2017.11.26)的用戶行爲:
ds2 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.25T00:00:00 2017.11.25T06:00:000 2017.11.25T09:00:00 2017.11.25T12:00:00 2017.11.25T14:00:00 2017.11.25T18:00:00 2017.11.25T22:00:00 2017.11.25T23:59:59) SatBehavior = mr(ds2, caculate, , unionAll{, false})
ds3 = repartitionDS(<select * from user>, `behaveTime, RANGE,2017.11.26T00:00:00 2017.11.26T06:00:000 2017.11.26T09:00:00 2017.11.26T12:00:00 2017.11.26T14:00:00 2017.11.26T18:00:00 2017.11.26T22:00:00 2017.11.26T23:59:59) SunBehavior = mr(ds3, caculate, , unionAll{, false})
週六和週日各個時間段淘寶APP的使用率都比工做日的使用率要高。一樣地,週六日淘寶APP使用高峯是凌晨(0點到6點)。
3. 商品分析
allItems=select distinct(itemID) from user
4142583
在這9天中,一共涉及到4,142,583種商品。
統計每一個商品的購買次數:
itemBuyTimes=select count(userID) as times from user where behavior="buy" group by itemID order by times desc
統計銷量前20的商品:
salesTop=select top 20 * from itemBuyTimes order by times desc
ID爲3122135的商品銷量最高,一共有1,408次購買。
統計各個購買次數下商品的數量:
buyTimesItemNum=select count(itemID) as itemNums from itemBuyTimes group by times order by itemNums desc
結果顯示,絕大部分(370,747種)商品在這9天中都只被購買了一次,佔全部商品的8.94%。購買次數越多,涉及到的商品數量越少。
統計全部商品的用戶行爲數量:
allItemsInfo=select sum(iif(behavior=="pv",1,0)) as pageView, sum(iif(behavior=="fav",1,0)) as favorite, sum(iif(behavior=="cart",1,0)) as shoppingCart, sum(iif(behavior=="buy",1,0)) as payment from user group by itemID
統計瀏覽量前20的商品:
pvTop=select top 20 itemID,pageView from allItemsInfo order by pageView desc
瀏覽量最高的商品ID爲812879,共有29,720次瀏覽,可是銷量僅爲135,沒有進入到銷量前20。
統計銷量前20的商品各個用戶行爲的數量:
select * from ej(salesTop,allItemsInfo,`itemID) order by times desc
銷量最高的商品3122135的瀏覽量爲1777,沒有進入瀏覽量前20,從瀏覽到購買的轉化率高達79.2%,該商品有多是剛需用品,用戶不須要太多瀏覽就決定購買。
擴展練習:
(1)計算2017.11.25每小時淘寶APP的購買率(購買率=購買次數/總行爲次數*100%)
(2)找出購買次數最多的用戶以及他購買最多的商品
(3)計算商品ID爲3122135的商品在各個時間段中的購買次數
(4)統計每一個類別每一個行爲的次數
(5)計算每一個類別中銷量最高的商品
本教程僅供學習使用。