皇包車(HI GUIDES)是一個爲中國出境遊用戶提供全球中文包車遊服務的平臺。擁有境外10萬名華人司機兼導遊(司導),覆蓋全球90多個國家,1600多個城市,300多個國際機場。截止2017年6月,已累計服務400萬中國出境遊用戶。html
因爲消費者消費能力逐漸加強、 旅遊信息不透明程度的降低,遊客的行爲逐漸變得難以預測,傳統旅行社的旅遊路線模式已經不能知足遊客需求。如何爲用戶提供更受歡迎、更合適的包車遊路線,就須要藉助大數據的力量。結合用戶我的喜愛、景點受歡迎度、天氣交通等維度,制定多套旅遊信息化解決方案。python
賽題地址:https://www.dcjingsai.com/com...app
黃包車提供五萬餘條客戶瀏覽APP行爲,其中有些客戶在瀏覽後完成了訂單,且享受了精品旅遊服務,而有些用戶則沒有下單。
參賽者須要分析用戶的我的信息和瀏覽行爲,從而預測用戶是否會在短時間內購買精品旅遊服務。echarts
import pandas as pd import numpy as np from sklearn import preprocessing import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import warnings warnings.filterwarnings('ignore') plt.rcParams['font.sans-serif'] = [u'SimHei'] plt.rcParams['axes.unicode_minus'] = False user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv') action_train = pd.read_csv(r'Data\trainingset\action_train.csv') comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv') orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv') orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv') user_test = pd.read_csv(r'Data\test\userProfile_test.csv') action_test = pd.read_csv(r'Data\test\action_test.csv') comment_test = pd.read_csv(r'Data\test\userComment_test.csv') orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv') orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv') user = pd.concat([user_train,user_test]) action = pd.concat([action_train,action_test]) comment = pd.concat([comment_train,comment_test]) orderHistory = pd.concat([orderHistory_train,orderHistory_test]) orderFuture = pd.concat([orderFuture_train,orderFuture_test])
理解數據是進行分析和建模的基礎,數據共有五張表,分別是用戶信息表(user)、用戶評論表(comment)、用戶行爲表(action)、歷史訂單表(orderHistory)、將來訂單表(orderFuturen),如下是各表預覽。dom
user.head()
userid | gender | province | age | |
---|---|---|---|---|
0 | 100000000013 | 男 | NaN | 60後 |
1 | 100000000111 | NaN | 上海 | NaN |
2 | 100000000127 | NaN | 上海 | NaN |
3 | 100000000231 | 男 | 北京 | 70後 |
4 | 100000000379 | 男 | 北京 | NaN |
action.head()
userid | actionType | actionTime | |
---|---|---|---|
0 | 100000000013 | 1 | 1474300753 |
1 | 100000000013 | 5 | 1474300763 |
2 | 100000000013 | 6 | 1474300874 |
3 | 100000000013 | 5 | 1474300911 |
4 | 100000000013 | 6 | 1474300936 |
orderHistory.head()
userid | orderid | orderTime | orderType | city | country | continent | |
---|---|---|---|---|---|---|---|
0 | 100000000013 | 1000015 | 1481714516 | 0 | 柏林 | 德國 | 歐洲 |
1 | 100000000013 | 1000014 | 1501959643 | 0 | 舊金山 | 美國 | 北美洲 |
2 | 100000000393 | 1000033 | 1499440296 | 0 | 巴黎 | 法國 | 歐洲 |
3 | 100000000459 | 1000036 | 1480601668 | 0 | 紐約 | 美國 | 北美洲 |
4 | 100000000459 | 1000034 | 1479146723 | 0 | 巴厘島 | 印度尼西亞 | 亞洲 |
orderFuture.head()
orderType | userid | |
---|---|---|
0 | 0.0 | 100000000013 |
1 | 0.0 | 100000000111 |
2 | 0.0 | 100000000127 |
3 | 0.0 | 100000000231 |
4 | 0.0 | 100000000379 |
comment.head()
userid | orderid | rating | tags | commentsKeyWords | |
---|---|---|---|---|---|
0 | 100000000013 | 1000015 | 4.0 | NaN | ['很','簡陋','太','隨便'] |
1 | 100000000231 | 1000024 | 5.0 | 提早聯繫|耐心等候 | ['很','細心'] |
2 | 100000000471 | 1000038 | 5.0 | NaN | NaN |
3 | 100000000637 | 1000040 | 5.0 | 主動熱情|提早聯繫|舉牌迎接|主動搬運行李 | NaN |
4 | 100000000755 | 1000045 | 1.0 | 未舉牌服務 | NaN |
用戶信息表共40307條用戶數據,userid是惟一標識,數據缺失較爲嚴重。性能
user.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50383 entries, 0 to 10075 Data columns (total 4 columns): userid 50383 non-null int64 gender 19769 non-null object province 45484 non-null object age 5961 non-null object dtypes: int64(1), object(3) memory usage: 1.9+ MB
(user.province.value_counts()/user.province.value_counts().sum()).head().sum()
0.7712162518687891
用戶以北京、上海、廣東、江蘇、浙江等發達地區爲主,五地區佔到總用戶數的77%。測試
fig,axes = plt.subplots(figsize=(20,10)) sns.countplot(x='province',data=user,order=user.province.value_counts().index.tolist())
用戶性別共15760條數據,女性佔54.7%,男性45.3%。大數據
fig,axes = plt.subplots(1,2,figsize=(12,4)) user.gender.value_counts().plot.bar(ax=axes[0]) axes[0].set_xticklabels(['女','男'],rotation=0) user.gender.value_counts().plot.pie(ax=axes[1],autopct='%.2f%%')
用戶年齡共4742條信息,以60後、70後、80後、90後爲主。優化
fig,axes = plt.subplots(figsize=(10,4)) user.age.value_counts().plot.bar() plt.xticks(rotation=0)
雖然女性用戶多餘男性,但提供年齡信息的用戶中,男性多於女性,看來即便是匿名,女性也不肯意暴露年齡啊。spa
fig,axes = plt.subplots(figsize=(10,4)) sns.countplot(x='age',data=user,hue='gender')
行爲類型一共有9個,其中1是喚醒app;2-4是瀏覽產品,無前後關係;5-9則是有前後關係的,從填寫表單到提交訂單再到最後支付。
import time def time_convert(timestamp): str_time =time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(timestamp)) return str_time action.actionTime = action.actionTime.map(lambda x: time_convert(x))
action['year']=action.actionTime.str[:4] action['month']=action.actionTime.str[5:7] action['day']=action.actionTime.str[8:10] action['date']=action.actionTime.str[:10] action['time']=action.actionTime.str[11:] action['year_month']=action.actionTime.str[:7] action['hour']=action.actionTime.str[11:13]
MAU用月內產生用戶行爲的獨立ID數量表示,PV用喚醒APP(行爲1)次數表示。用戶活躍的兩個峯值分別在四五月和十月,小長假是人們出國遊的首選時間。
fig,axes = plt.subplots(2,1,figsize=(10,10)) action[action['year_month'] !='2016-08'].drop_duplicates(['userid']).groupby('year_month').userid.count().plot(ax=axes[0]) axes[0].set_title('獨立用戶月訪問量(MAU)') action[action.actionType==1].groupby('year_month').userid.count().plot(ax=axes[1]) axes[1].set_title('用戶月訪問量(PV)')
DAU爲日內產生用戶行爲的獨立ID數,PV爲日內行爲爲1的行爲條數。DAU峯值出如今4月初,但同一時間段內的PV卻相對PV峯值5月初較低,說明4月初平均每用戶喚醒次數較低,多是有拉新活動。對兩項指標相除,能夠驗證以上猜測。一樣的,16年12月以前用戶的PV/DAU較大,以後較爲平穩,APP進入健康平穩期。
fig,axes = plt.subplots(2,1,figsize=(10,10)) action.drop_duplicates(['userid']).groupby('date').userid.count().plot(ax=axes[0]) axes[0].set_title('獨立用戶日訪問量(DAU)') action[action['actionType']==1].groupby('date').userid.count().plot(ax=axes[1]) axes[1].set_title('用戶日訪問量(PV)')
fig,axes = plt.subplots(figsize=(10,5)) (action.drop_duplicates(['userid']).groupby('date').userid.count()/action[action['actionType']==1].groupby('date').userid.count()).plot()
數據的點擊量呈現一個很是奇怪的形狀,在日間(8點到16點)呈現較低的訪問量,並在12點左右達到最低值,多是數據缺失或時區錯誤。
fig,axes = plt.subplots(2,1,figsize=(10,10)) action.drop_duplicates(['userid']).groupby('hour').userid.count().plot(ax=axes[0]) axes[0].set_title('獨立用戶小時訪問量(HAU)') action[action['actionType']==1].groupby('hour').userid.count().plot(ax=axes[1]) axes[1].set_title('用戶日小時訪問量(PV)')
#對訪問類型分類 def vis_type(x): if x in [2,3,4]: return 2 else: return x action['visitor_type']=action['actionType'].map(lambda x: vis_type(x))
fig,axes = plt.subplots(figsize=(10,5)) diff_visitor = action.groupby(['hour','visitor_type']).userid.count().unstack() plt.plot(diff_visitor) plt.title('用戶日小時訪問量(PV)')
首先是喚醒APP(1)到瀏覽頁面的轉化(2)數據結果正常,但填寫表單(5)數量遠大於操做一、2,即大量表單在沒有使用APP的狀況下填寫,多是經過其餘渠道跳入填寫頁面,或數據缺失嚴重。同時,填寫表單(7)數量小於(8),可能數據缺失缺失較爲嚴重。
from example.commons import Faker from pyecharts import options as opts from pyecharts.charts import Funnel, Page df = action.groupby('visitor_type',as_index=False).userid.count().values.tolist() def funnel_base() -> Funnel: c = ( Funnel() .add("訪問量",df) .set_global_opts(title_opts=opts.TitleOpts(title="訪問轉化")) ) return c funnel_base().render_notebook()
評價表中共9863條數據,其中評分完好失值,平均分爲4.91,五星好評佔絕大多數。
comment.rating.mean()
4.916672610845424
from pyecharts.charts import Bar bar = Bar() bar.add_xaxis(comment.rating.value_counts().index.tolist()) bar.add_yaxis("評分", comment.rating.value_counts().values.tolist()) bar.render_notebook()
以四分爲分界線劃分好評與差評,分別製做詞雲圖以下:
tags_count = comment[comment.rating>=4].tags.str.split("|").dropna().apply(pd.value_counts).sum() path=r'C:\Windows\Fonts\simhei.ttf' import wordcloud w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2) w.fit_words(tags_count) plt.figure(dpi=1000) plt.imshow(w) plt.axis('off')
tags_count = comment[comment.rating<4].tags.str.split("|").dropna().apply(pd.value_counts).sum() path=r'C:\Windows\Fonts\simhei.ttf' import wordcloud w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2) w.fit_words(tags_count) plt.figure(dpi=500) plt.imshow(w) plt.axis('off')
用戶評論關鍵詞一樣以4分爲分界線,分別製做詞雲圖。
Keyword_count=comment[comment['rating']>=4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum() path=r'C:\Windows\Fonts\simhei.ttf' import wordcloud w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2) w.fit_words(Keyword_count) plt.figure(dpi=1000) plt.imshow(w) plt.axis('off')
Keyword_count=comment[comment['rating']<4].commentsKeyWords.dropna().str[1:-1].str.split(',').apply(pd.value_counts).sum() path=r'C:\Windows\Fonts\simhei.ttf' import wordcloud w = wordcloud.WordCloud(font_path=path,width=1400, height=1400, margin=2) w.fit_words(Keyword_count) plt.figure(dpi=1000) plt.imshow(w) plt.axis('off')
該數據描述了用戶的歷史訂單信息。數據共有7列,分別是用戶id,訂單id,訂單時間,訂單類型,旅遊城市,國家,大陸。其中1表示購買了精品旅遊服務,0表示普通旅遊服務。
訂單數據共20653項,涵蓋10637名用戶,用戶復購圖以下:
order_number=orderHistory.groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'}) order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8次以上','count':order_number[8:].count().sum()}])]) from pyecharts.charts import Page, Pie def pie_base() -> Pie: c = ( Pie() .add('',order_number.values.tolist()) .set_global_opts(title_opts=opts.TitleOpts(title="全部服務用戶復購圖")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) return c pie_base().render_notebook()
order_number=orderHistory[orderHistory.orderType==1].groupby(['userid'],as_index=False).orderid.count().groupby('orderid',as_index=False).userid.count().rename(columns={'orderid':'order_quantity','userid':'count'}) order_number=pd.concat([order_number[:8],pd.DataFrame([{'order_quantity':'8次以上','count':order_number[8:].count().sum()}])]) from pyecharts.charts import Page, Pie def pie_base() -> Pie: c = ( Pie() .add('',order_number.values.tolist()) .set_global_opts(title_opts=opts.TitleOpts(title="精品服務用戶復購圖")) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) return c pie_base().render_notebook()
對比發現精品服務的復購率與總復購率類似。
orderHistory.orderTime = orderHistory.orderTime.map(lambda x: time_convert(x)) orderHistory['year']=orderHistory.orderTime.str[:4] orderHistory['month']=orderHistory.orderTime.str[5:7] orderHistory['day']=orderHistory.orderTime.str[8:10] orderHistory['date']=orderHistory.orderTime.str[:10] orderHistory['time']=orderHistory.orderTime.str[11:] orderHistory['year_month']=orderHistory.orderTime.str[:7] orderHistory['hour']=orderHistory.orderTime.str[11:13]
from pyecharts.charts import Bar from pyecharts import options as opts jingpin_top10 = orderHistory[orderHistory.orderType==1].city.value_counts()[:10] bar = Bar() bar.add_xaxis(jingpin_top10.index.tolist()) bar.add_yaxis("精品遊十大熱門城市", jingpin_top10.values.tolist()) bar.render_notebook()
from pyecharts.globals import ThemeType putong_top10 = orderHistory[orderHistory.orderType==0].city.value_counts()[:10] bar = Bar({"theme": ThemeType.ESSOS}) bar.add_xaxis(putong_top10.index.tolist()) bar.add_yaxis("普通遊十大熱門城市", putong_top10.values.tolist()) bar.render_notebook()
continent_jingpin=orderHistory[orderHistory['orderType']==1].groupby(['continent'],as_index=False).orderid.count() continent_putong = orderHistory[orderHistory['orderType']==0].groupby(['continent'],as_index=False).orderid.count() continent_putong = pd.concat([continent_putong,pd.DataFrame([{'continent':'南美洲','userid':0}])]).sort_values('continent') bar = Bar() bar.add_xaxis(continent_jingpin.continent.tolist()) bar.add_yaxis("精品遊大陸分佈", continent_jingpin.orderid.values.tolist()) bar.add_yaxis("普通遊大陸分佈", continent_putong.orderid.values.tolist()) bar.render_notebook()
country_boutique = orderHistory[orderHistory['orderType']==1].groupby('country').country.count().sort_values(ascending = False)[:10] country_ordinary = orderHistory[orderHistory['orderType']==0].groupby('country').country.count().sort_values(ascending = False)[:10] bar = Bar() bar.add_xaxis(country_boutique.index.tolist()) bar.add_yaxis("精品遊十大熱門國家", country_boutique.values.tolist()) bar.render_notebook()
bar = Bar() bar.add_xaxis(country_ordinary.index.tolist()) bar.add_yaxis("普通遊十大熱門國家", country_ordinary.values.tolist()) bar.render_notebook() def bar_base_dict_config() -> Bar: c = ( Bar({"theme": ThemeType.MACARONS}) .add_xaxis(country_ordinary.index.tolist()) .add_yaxis("普通遊十大熱門國家", country_ordinary.values.tolist()) ) return c bar_base_dict_config().render_notebook()
特徵工程包括一切對特徵的處理,特徵提取、特徵組合、標準化、特徵篩選等,對連續特徵進行分箱,對分類特徵進行虛擬變量化,這裏只提取了一些特徵,沒有進行特徵篩選。
import pandas as pd import numpy as np from sklearn import preprocessing import warnings warnings.filterwarnings('ignore') user_train = pd.read_csv(r'Data\trainingset\userProfile_train.csv') action_train = pd.read_csv(r'Data\trainingset\action_train.csv') comment_train = pd.read_csv(r'Data\trainingset\userComment_train.csv') orderFuture_train= pd.read_csv(r'Data\trainingset\orderFuture_train.csv') orderHistory_train= pd.read_csv(r'Data\trainingset\orderHistory_train.csv') user_test = pd.read_csv(r'Data\test\userProfile_test.csv') action_test = pd.read_csv(r'Data\test\action_test.csv') comment_test = pd.read_csv(r'Data\test\userComment_test.csv') orderFuture_test = pd.read_csv(r'Data\test\orderFuture_test.csv') orderHistory_test = pd.read_csv(r'Data\test\orderHistory_test.csv') user = pd.concat([user_train,user_test]) action = pd.concat([action_train,action_test]) comment = pd.concat([comment_train,comment_test]) orderHistory = pd.concat([orderHistory_train,orderHistory_test]) orderFuture = pd.concat([orderFuture_train,orderFuture_test]) orderHistory = orderHistory.sort_values(by=['userid','orderTime']) #歷史訂單數量,時間戳統計值 orderHistory_internal_table = orderHistory.groupby('userid').orderTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'order_count', 'max':'ordertime_max', 'min':'ordertime_min', 'std':'orderTime_std', 'mean':'ordertime_mean'}).fillna(0) #歷史訂單普通、精品訂單數 orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==0].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'ordinary_count'}),how='left',on='userid') orderHistory_internal_table = orderHistory_internal_table.merge(orderHistory[orderHistory['orderType']==1].groupby('userid').orderid.count().reset_index().rename(columns={'orderid':'unordinary_count'}),how='left',on='userid') #去過的國家、大陸、城市有幾回。 orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory[['userid','country','continent','city']]).groupby('userid',as_index=False).sum(),on='userid',how='left') #最後一次行程信息 orderHistory_internal_table = orderHistory_internal_table.merge(pd.get_dummies(orderHistory.groupby('userid',as_index=False).apply(lambda x:x.iloc[-1])[['userid','orderType','city','country','continent']]),on='userid',how='left') data = orderFuture.copy() #以orderFuture爲基礎 data = data.merge(user) #鏈接user data = data.merge(comment,how = 'left') #鏈接comment data['tags'] = data.tags.apply(lambda x : 0 if pd.isnull(x) else 1) #將tag分爲有無 data['commentsKeyWords'] = data.commentsKeyWords.apply(lambda x:0 if pd.isnull(x) else 1) #將評論分爲有無 del data['orderid'] #刪除orderid列 action = action.sort_values(by=['userid','actionTime']) #按照userid,actiontime排序 #生成中間表包含action信息,首先是每一個id的action數量,最大最小時間,均值標準差 action_internal_table = action.groupby('userid').actionTime.agg(['count','max','min','std','mean']).reset_index().rename(columns = {'count':'action_count', 'max':'time_last_action', 'min':'time_first_action', 'std':'actiontime_std', 'mean':'actiontime_mean'}) #2-4與5-9的比例 #增長每一個id的倒數第1-20個行爲類別 for i in range(20): action_internal_table = action_internal_table.merge(action.groupby('userid').actionType.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionType':'last_but{}_action_type'.format(i)}).reset_index(),how='left') del action_internal_table['index'] #每一個行爲類型所佔的比例 count = action.groupby('userid').actionType.count() for i in range(1,10): action_internal_table = action_internal_table.merge((action[action['actionType']==i].groupby('userid').actionType.count()/count).reset_index().rename(columns={'actionType':'rate_{}'.format(i)}).fillna(0),on='userid',how='left') #倒數第1-20個時間戳 for i in range(20): action_internal_table = action_internal_table.merge(action.groupby('userid').actionTime.apply(lambda x:x.iloc[-i-1] if len(x)>i else np.nan).reset_index().rename(columns={'actionTime':'last_but{}_action_type'.format(i)}).reset_index(),how='left') del action_internal_table['index'] data = data.merge(action_internal_table,on='userid',how='left') data = data.merge(orderHistory_internal_table,on='userid',how='left') data = data.fillna(-999) data = pd.get_dummies(data) #剩餘分類變量直接虛擬變量化
使用xgboost建模,無須數據標準化。
import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold from sklearn.metrics import roc_auc_score #劃分訓練集和驗證集 X_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,2:] y_trainval = data[data['userid'].isin(orderFuture_train.userid.tolist())].iloc[:,0] X_train,X_val,y_train,y_val = train_test_split(X_trainval,y_trainval,random_state=88,stratify=y_trainval) #構建xgb分類器對象並訓練 xgb_cla = xgb.XGBClassifier(learning_rate=0.1, n_estimators=1000, max_depth=3, min_child_weight=5, gamma=0, subsample= 0.8, colsample_bytree=0.8, eta=0.05, silent=1, objective='binary:logistic', scale_pos_weight=1).fit(X_train,y_train) #計算AUC roc_auc_score(y_val,xgb_cla.predict_proba(X_val)[:,1]) #預測 X_test = data[data['userid'].isin(orderFuture_test.userid.tolist())].iloc[:,2:] predict = xgb_cla.predict_proba(X_test)[:,1] orderFuture_test['orderType']=predict orderFuture_test.to_csv('submission.csv',encoding='utf-8',index=False)
最終AUC:
因爲電腦性能問題,部分優化未能完成,如:
在實際項目中使用sklearn建模時,須要注意: