本項目爲kaggle上Facebook舉行的一次比賽,地址見數據來源,完整代碼見個人github,歡迎來玩~html
數據探索——Data_Exploration.ipynbpython
數據預處理&特徵工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynbgit
模型設計及評測——Model_Design.ipynbgithub
kaggle算法
numpysegmentfault
pandasapp
matplotlib機器學習
sklearnide
xgboost工具
mlxtend: 含有聚和算法Stacking
項目總體運行時間預估爲60min左右,在Ubuntu系統,8G內存,運行結果見所提交的jupyter notebook文件
因爲文章內容過長,因此分爲兩篇文章,總共包含四個部分
數據探索
數據預處理及特徵工程
模型設計
評估及總結
import numpy as np import pandas as pd %matplotlib inline from IPython.display import display
df_bids = pd.read_csv('bids.csv', low_memory=False) df_train = pd.read_csv('train.csv') df_test = pd.read_csv('test.csv')
df_bids.head()
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
df_train.head() # df_train.dtypes
bidder_id | payment_account | address | outcome | |
---|---|---|---|---|
0 | 91a3c57b13234af24875c56fb7e2b2f4rb56a | a3d2de7675556553a5f08e4c88d2c228754av | a3d2de7675556553a5f08e4c88d2c228vt0u4 | 0.0 |
1 | 624f258b49e77713fc34034560f93fb3hu3jo | a3d2de7675556553a5f08e4c88d2c228v1sga | ae87054e5a97a8f840a3991d12611fdcrfbq3 | 0.0 |
2 | 1c5f4fc669099bfbfac515cd26997bd12ruaj | a3d2de7675556553a5f08e4c88d2c2280cybl | 92520288b50f03907041887884ba49c0cl0pd | 0.0 |
3 | 4bee9aba2abda51bf43d639013d6efe12iycd | 51d80e233f7b6a7dfdee484a3c120f3b2ita8 | 4cb9717c8ad7e88a9a284989dd79b98dbevyi | 0.0 |
4 | 4ab12bc61c82ddd9c2d65e60555808acqgos1 | a3d2de7675556553a5f08e4c88d2c22857ddh | 2a96c3ce94b3be921e0296097b88b56a7x1ji | 0.0 |
# 查看各表格中是否存在空值 print 'Is there any missing value in bids?',df_bids.isnull().any().any() print 'Is there any missing value in train?',df_train.isnull().any().any() print 'Is there any missing value in test?',df_test.isnull().any().any()
Is there any missing value in bids? True Is there any missing value in train? False Is there any missing value in test? False
整個對三個數據集進行空值判斷,發現用戶數據訓練集和測試集均完好失數據,而在競標行爲數據集中存在缺失值的狀況,下面便針對bids數據進一步尋找缺失值
# nan_rows = df_bids[df_bids.isnull().T.any().T] # print nan_rows pd.isnull(df_bids).any()
bid_id False bidder_id False auction False merchandise False device False time False country True ip False url False dtype: bool
missing_country = df_bids['country'].isnull().sum().sum() print 'No. of missing country: ', missing_country normal_country = df_bids['country'].notnull().sum().sum() print 'No. of normal country: ', normal_country
No. of missing country: 8859 No. of normal country: 7647475
import matplotlib.pyplot as plt labels = ['unknown', 'normal'] sizes = [missing_country, normal_country] explode = (0.1, 0) fig1, ax1 = plt.subplots() ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) ax1.axis('equal') plt.title('Distribution of missing countries vs. normal countries') plt.show()
綜合上述的分析能夠發現,在競標行爲用戶的country
一欄屬性中存在不多一部分用戶行爲是沒有country
記錄的,在預處理部分能夠針對這部分缺失數據進行填充操做,有兩種思路:
針對原始行爲數據按照用戶分組後,看看每一個對應的用戶競標時常常所位於的國家信息,對缺失值填充常駐國家
針對原始行爲數據按照用戶分組後,按時間順序對每組用戶中的缺失值前向或後向填充相鄰的國家信息
# 查看各個數據的記錄數 # 看看數據的id是不是惟一標識 print df_bids.shape[0] print len(df_bids['bid_id'].unique()) print df_train.shape[0] print len(df_train['bidder_id'].unique()) print df_test.shape[0] print len(df_test['bidder_id'].unique())
7656334 7656334 2013 2013 4700 4700
# 簡單統計各項基本特徵(類別特徵)的數目(除去時間) print 'total bidder in bids: ', len(df_bids['bidder_id'].unique()) print 'total auction in bids: ', len(df_bids['auction'].unique()) print 'total merchandise in bids: ', len(df_bids['merchandise'].unique()) print 'total device in bids: ', len(df_bids['device'].unique()) print 'total country in bids: ', len(df_bids['country'].unique()) print 'total ip in bids: ', len(df_bids['ip'].unique()) print 'total url in bids: ', len(df_bids['url'].unique())
total bidder in bids: 6614 total auction in bids: 15051 total merchandise in bids: 10 total device in bids: 7351 total country in bids: 200 total ip in bids: 2303991 total url in bids: 1786351
由上述基本特徵能夠看到:
競標行爲中的用戶總數少於訓練集+測試集的用戶數,也就是說並非一一對應的,接下來驗證下競標行爲數據中的用戶是否徹底來自訓練集和測試集
商品類別和國家的種類相對其餘特徵較少,能夠做爲自然的類別特徵提取出來進行處理,而其他的特徵可能更多的進行計數統計
lst_all_users = (df_train['bidder_id'].unique()).tolist() + (df_test['bidder_id'].unique()).tolist() print 'total bidders of train and test set',len(lst_all_users) lst_bidder = (df_bids['bidder_id'].unique()).tolist() print 'total bidders in bids set',len(lst_bidder) print 'Is bidders in bids are all from train+test set? ',set(lst_bidder).issubset(set(lst_all_users))
total bidders of train and test set 6713 total bidders in bids set 6614 Is bidders in bids are all from train+test set? True
lst_nobids = [i for i in lst_all_users if i not in lst_bidder] print 'No. of bidders never bid: ',len(lst_nobids) lst_nobids_train = [i for i in lst_nobids if i in (df_train['bidder_id'].unique()).tolist()] lst_nobids_test = [i for i in lst_nobids if i in (df_test['bidder_id'].unique()).tolist()] print 'No. of bidders never bid in train set: ',len(lst_nobids_train) print 'No. of bidders never bid in test set: ',len(lst_nobids_test)
No. of bidders never bid: 99 No. of bidders never bid in train set: 29 No. of bidders never bid in test set: 70
data_source = ['train', 'test'] y_pos = np.arange(len(data_source)) num_never_bids = [len(lst_nobids_train), len(lst_nobids_test)] plt.bar(y_pos, num_never_bids, align='center', alpha=0.5) plt.xticks(y_pos, data_source) plt.ylabel('bidders no bids') plt.title('Source of no bids bidders') plt.show()
print df_train[(df_train['bidder_id'].isin(lst_nobids_train)) & (df_train['outcome']==1.0)]
Empty DataFrame Columns: [bidder_id, payment_account, address, outcome] Index: []
由上述計算可知存在99個競標者無競標記錄,其中29位來自訓練集,70位來自測試集,並且這29位來自訓練集的競標者未被標記爲機器人用戶,因此能夠針對測試集中的這70位用戶後續標記爲人類或者取平均值處理
# check the partition of bots in train print (df_train[df_train['outcome'] == 1].shape[0]*1.0) / df_train.shape[0] * 100,'%'
5.11674118231 %
訓練集中的標記爲機器人的用戶佔全部用戶數目約5%
df_train.groupby('outcome').size().plot(labels=['Human', 'Robot'], kind='pie', autopct='%.2f', figsize=(4, 4), title='Distribution of Human vs. Robots', legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f477135c5d0>
由上述訓練集中的正負例分佈能夠看到本數據集正負例比例失衡,因此後續考慮使用AUC(不受正負例比例影響)做爲評價指標,此外儘可能採用Gradient Boosting族模型來進行訓練
import numpy as np import pandas as pd import pickle %matplotlib inline from IPython.display import display
bids = pd.read_csv('bids.csv') train = pd.read_csv('train.csv') test = pd.read_csv('test.csv')
針對前面數據探索部分所發現的競標行爲數據中存在的國家眷性缺失問題,考慮使用針對原始行爲數據按照用戶分組後,按時間順序對每組用戶中的缺失值前向或後向填充相鄰的國家信息的方法來進行缺失值的填充處理
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
# pd.algos.is_monotonic_int64(bids.time.values, True)[0] print 'Is the time monotonically non-decreasing? ', pd.Index(bids['time']).is_monotonic
Is the time monotonically non-decreasing? False
# bidder_group = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id') bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].ffill() bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].bfill()
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
print 'Is there any missing value in bids?',bids.isnull().any().any() # pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? True
missing_country = bids['country'].isnull().sum().sum() print 'No. of missing country: ', missing_country normal_country = bids['country'].notnull().sum().sum() print 'No. of normal country: ', normal_country
No. of missing country: 5 No. of normal country: 7656329
nan_rows = bids[bids.isnull().T.any().T] print nan_rows
bid_id bidder_id auction \ 1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix 2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs 2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39 3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39 5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39 merchandise device time country \ 1351177 office equipment phone84 9767200789473684 NaN 2754184 mobile phone150 9633363947368421 NaN 2836631 jewelry phone72 9634034894736842 NaN 3125892 books and music phone106 9635755105263157 NaN 5153748 mobile phone267 9645270210526315 NaN ip url 1351177 80.211.119.111 g9pgdfci3yseml5 2754184 20.67.240.88 ctivbfq55rktail 2836631 149.210.107.205 vasstdc27m7nks3 3125892 26.23.62.59 ac9xlqtfg0cx5c5 5153748 145.7.194.40 0em0vg1f0zuxonw
# print bids[bids['bid_id']==1351177] nan_bidder = nan_rows['bidder_id'].values.tolist() # print nan_bidder print bids[bids['bidder_id'].isin(nan_bidder)]
bid_id bidder_id auction \ 1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix 2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs 2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39 3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39 5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39 merchandise device time country \ 1351177 office equipment phone84 9767200789473684 NaN 2754184 mobile phone150 9633363947368421 NaN 2836631 jewelry phone72 9634034894736842 NaN 3125892 books and music phone106 9635755105263157 NaN 5153748 mobile phone267 9645270210526315 NaN ip url 1351177 80.211.119.111 g9pgdfci3yseml5 2754184 20.67.240.88 ctivbfq55rktail 2836631 149.210.107.205 vasstdc27m7nks3 3125892 26.23.62.59 ac9xlqtfg0cx5c5 5153748 145.7.194.40 0em0vg1f0zuxonw
在對總體數據的部分用戶缺失國家的按照各個用戶分組後在時間上前向和後向填充後,仍然存在5個用戶缺失了國家信息,結果發現這5個用戶是僅有一次競標行爲,下面看看這5個用戶還有什麼特徵
lst_nan_train = [i for i in nan_bidder if i in (train['bidder_id'].unique()).tolist()] lst_nan_test = [i for i in nan_bidder if i in (test['bidder_id'].unique()).tolist()] print 'No. of bidders 1 bid in train set: ',len(lst_nan_train) print 'No. of bidders 1 bid in test set: ',len(lst_nan_test)
No. of bidders 1 bid in train set: 1 No. of bidders 1 bid in test set: 4
print train[train['bidder_id']==lst_nan_train[0]]['outcome']
546 0.0 Name: outcome, dtype: float64
因爲這5個用戶僅有一次競標行爲,並且其中1個用戶來自訓練集,4個來自測試集,由訓練集用戶的標記爲人類,加上行爲數太少,因此考慮對這5個用戶的競標行爲數據予以捨棄,特別對測試集的4個用戶後續操做相似以前對無競標行爲的用戶,預測值填充最終模型的平均預測值
bid_to_drop = nan_rows.index.values.tolist() # print bid_to_drop bids.drop(bids.index[bid_to_drop], inplace=True)
print 'Is there any missing value in bids?',bids.isnull().any().any() pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? False
根據前面的數據探索,因爲數據集大部分由類別數據或者離散型數據構成,因此首先針對競標行爲數據按照競標者分組統計其各項屬性的數目,好比使用設備種類,參與競標涉及國家,ip種類等等
# group by bidder to do some statistics bidders = bids.groupby('bidder_id') # pickle.dump(bids, open('bidders.pkl', 'w'))
# print bidders['device'].count() def feature_count(group): dct_cnt = {} dct_cnt['devices_c'] = group['device'].unique().shape[0] dct_cnt['countries_c'] = group['country'].unique().shape[0] dct_cnt['ip_c'] = group['ip'].unique().shape[0] dct_cnt['url_c'] = group['url'].unique().shape[0] dct_cnt['auction_c'] = group['auction'].unique().shape[0] dct_cnt['auc_mean'] = np.mean(group['auction'].value_counts()) # bids_c/auction_c # dct_cnt['dev_mean'] = np.mean(group['device'].value_counts()) # bids_c/devices_c dct_cnt['merch_c'] = group['merchandise'].unique().shape[0] dct_cnt['bids_c'] = group.shape[0] dct_cnt = pd.Series(dct_cnt) return dct_cnt
cnt_bidder = bidders.apply(feature_count)
display(cnt_bidder.describe()) # cnt_bidder.to_csv('cnt_bidder.csv') # print cnt_bidder[cnt_bidder['merch_c']==2]
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | merch_c | url_c | |
---|---|---|---|---|---|---|---|---|
count | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 |
mean | 6.593493 | 57.850810 | 1158.470117 | 12.733848 | 73.492359 | 544.507187 | 1.000151 | 290.964140 |
std | 30.009242 | 131.814053 | 9596.595169 | 22.556570 | 172.171106 | 3370.730666 | 0.012301 | 2225.912425 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 1.000000 | 2.000000 | 3.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 |
50% | 1.677419 | 10.000000 | 18.000000 | 3.000000 | 8.000000 | 12.000000 | 1.000000 | 5.000000 |
75% | 4.142857 | 47.000000 | 187.000000 | 12.000000 | 57.000000 | 111.000000 | 1.000000 | 36.000000 |
max | 1327.366667 | 1726.000000 | 515033.000000 | 178.000000 | 2618.000000 | 111918.000000 | 2.000000 | 81376.000000 |
在對競標行爲數據按照用戶分組後,對數據集中的每個產品特徵構建一個散佈矩陣(scatter matrix),來看看各特徵之間的相關性
# 對於數據中的每一對特徵構造一個散佈矩陣 pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
在針對競標行爲數據按照競標用戶進行分組基本統計後由上表能夠看出,此時並未考慮時間戳的情形下,有如下基本結論:
由各項統計的最大值與中位值,75%值的比較能夠看到除了商品類別一項,其餘的幾項多少都存在一些異常數值,或許能夠做爲異常行爲進行觀察
各特徵的傾斜度很大,考慮對特徵進行取對數的操做,並再次輸出散佈矩陣看看相關性。
商品類別計數這一特徵的方差很小,並且從中位數乃至75%的統計來看,大多數用戶僅對同一類別商品進行拍賣,並且由於前面數據探索部分發現商品類別自己適合做爲類別數據,因此考慮分多個類別進行單獨統計,而在計數特徵中捨棄該特徵。
cnt_bidder.drop('merch_c', axis=1, inplace=True)
cnt_bidder = np.log(cnt_bidder)
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
由上面的散佈矩陣能夠看到,個行爲特徵之間並無表現出很強的相關性,雖然其中的ip計數和競標計數,設備計數在進行對數操做處理以後表現出輕微的正相關性,可是因爲是在作了對數操做以後才體現,並且從圖中能夠看到並不是很強的相關性,因此保留這三個特徵。
針對前述的異常行爲,先從原train數據集中的機器人、人類中分別挑選幾個樣本進行追蹤觀察他們在按照bidders分組後的統計結果,對比看看
cnt_bidder.to_csv('cnt_bidder.csv')
# trace samples,first 2 bots, last 2 humen indices = ['9434778d2268f1fa2a8ede48c0cd05c097zey','aabc211b4cf4d29e4ac7e7e361371622pockb', 'd878560888b11447e73324a6e263fbd5iydo1','91a3c57b13234af24875c56fb7e2b2f4rb56a'] # build a DataFrame for the choosed indices samples = pd.DataFrame(cnt_bidder.loc[indices], columns = cnt_bidder.keys()).reset_index(drop = True) print "Chosen samples of training dataset:(first 2 bots, last 2 humen)" display(samples)
Chosen samples of training dataset:(first 2 bots, last 2 humen)
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | url_c | |
---|---|---|---|---|---|---|---|
0 | 3.190981 | 5.594711 | 8.785692 | 4.174387 | 6.011267 | 8.147578 | 7.557995 |
1 | 2.780432 | 4.844187 | 7.624619 | 2.639057 | 3.178054 | 5.880533 | 1.609438 |
2 | 0.287682 | 1.098612 | 1.386294 | 1.098612 | 1.386294 | 1.386294 | 0.000000 |
3 | 0.287682 | 2.890372 | 3.178054 | 1.791759 | 2.639057 | 2.995732 | 0.000000 |
使用seaborn來對上面四個例子的熱力圖進行可視化,看看percentile的狀況
import matplotlib.pyplot as plt import seaborn as sns # look at percentile ranks pcts = 100. * cnt_bidder.rank(axis=0, pct=True).loc[indices].round(decimals=3) print pcts # visualize percentiles with heatmap sns.heatmap(pcts, yticklabels=['robot 1', 'robot 2', 'human 1', 'human 2'], annot=True, linewidth=.1, vmax=99, fmt='.1f', cmap='YlGnBu') plt.title('Percentile ranks of\nsamples\' feature statistics') plt.xticks(rotation=45, ha='center');
auc_mean auction_c bids_c \ bidder_id 9434778d2268f1fa2a8ede48c0cd05c097zey 94.9 94.6 97.0 aabc211b4cf4d29e4ac7e7e361371622pockb 92.4 87.2 92.3 d878560888b11447e73324a6e263fbd5iydo1 39.8 30.4 30.2 91a3c57b13234af24875c56fb7e2b2f4rb56a 39.8 60.2 53.0 countries_c devices_c ip_c url_c bidder_id 9434778d2268f1fa2a8ede48c0cd05c097zey 95.4 95.6 96.7 97.4 aabc211b4cf4d29e4ac7e7e361371622pockb 77.3 63.8 84.8 50.3 d878560888b11447e73324a6e263fbd5iydo1 48.8 38.7 34.2 13.4 91a3c57b13234af24875c56fb7e2b2f4rb56a 63.7 56.8 56.2 13.4
由上面的熱力圖對比能夠看到,機器人的各項統計指標除去商品類別上的統計之外,均比人類用戶要高,因此考慮據此設計基於基本統計指標規則的基準模型,其中最顯著的特徵差別應該是在auc_mean
一項即用戶在各個拍賣場的平均競標次數,不妨先按照異常值處理的方法來找出上述基礎統計中的異常狀況
因爲最終目的是從競標者中尋找到機器人用戶,而根據常識,機器人用戶的各項競標行爲的操做應該比人類要頻繁許多,因此能夠從異常值檢驗的角度來設計樸素分類器,根據以前針對不一樣用戶統計的基本特徵計數狀況,能夠先針對每個特徵找出其中的疑似異經常使用戶列表,最後整合各個特徵生成的用戶列表,認爲超過多個特徵異常的用戶爲機器人用戶。
# find the outliers for each feature lst_outlier = [] for feature in cnt_bidder.keys(): # percentile 25th Q1 = np.percentile(cnt_bidder[feature], 25) # percentile 75th Q3 = np.percentile(cnt_bidder[feature], 75) step = 1.5 * (Q3 - Q1) # show outliers # print "Data points considered outliers for the feature '{}':".format(feature) display(cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))]) lst_outlier += cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))].index.values.tolist()
再找到各類特徵的全部可能做爲‘異常值’的用戶id以後,能夠對其作一個基本統計,進一步找出其中超過某幾個特徵值均異常的用戶,通過測試,考慮到原始train集合裏bots用戶不到5%,因此最終肯定以不低於1個特徵值均異常的用戶做爲異經常使用戶的一個假設,由此與train集合裏的用戶進行交叉,能夠獲得一個用戶子集,能夠做爲樸素分類器的一個操做方法。
# print len(lst_outlier) from collections import Counter freq_outlier = dict(Counter(lst_outlier)) perhaps_outlier = [i for i in freq_outlier if freq_outlier[i] >= 1] print len(perhaps_outlier)
214
# basic_pred = test[test['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist() train_pred = train[train['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist() print len(train_pred)
76
根據前面數據探索知本實驗中的數據集的正負例比例約爲19:1,有些失衡,因此考慮使用auc這種不受正負例比例影響的評價指標做爲衡量標準,現針對所涉及的樸素分類器在原始訓練集上的表現獲得一個基準得分
from sklearn.metrics import roc_auc_score y_true = train['outcome'] naive_pred = pd.DataFrame(columns=['bidder_id', 'prediction']) naive_pred['bidder_id'] = train['bidder_id'] naive_pred['prediction'] = np.where(naive_pred['bidder_id'].isin(train_pred), 1.0, 0.0) basic_pred = naive_pred['prediction'] print roc_auc_score(y_true, basic_pred)
0.54661464952
在通過上述對基本計數特徵的統計以後,目前還沒有針對非類別特徵:時間戳進行處理,而在以前的數據探索過程當中,針對商品類別和國家這兩個類別屬性,能夠將原始的單一特徵轉換爲多個特徵分別統計,此外,在上述分析過程當中,咱們發現針對用戶分組能夠進一步對於拍賣場進行分組統計。
對時間戳進行處理
針對商品類別、國家轉換爲多個類別分別進行統計
按照用戶-拍賣場進行分組進一步統計
主要是分析各個競標行爲的時間間隔,即統計競標行爲表中在同一拍賣場的各個用戶之間的競標行爲間隔
而後針對每一個用戶對其餘用戶的時間間隔計算
時間間隔均值
時間間隔最大值
時間間隔最小值
from collections import defaultdict def generate_timediff(): bids_grouped = bids.groupby('auction') bds = defaultdict(list) last_row = None for bids_auc in bids_grouped: for i, row in bids_auc[1].iterrows(): if last_row is None: last_row = row continue time_difference = row['time'] - last_row['time'] bds[row['bidder_id']].append(time_difference) last_row = row df = [] for key in bds.keys(): df.append({'bidder_id': key, 'mean': np.mean(bds[key]), 'min': np.min(bds[key]), 'max': np.max(bds[key])}) pd.DataFrame(df).to_csv('tdiff.csv', index=False)
generate_timediff()
因爲內容長度超過限制,後續內容請移步使用機器學習識別出拍賣場中做弊的機器人用戶(二)