使用機器學習識別出拍賣場中做弊的機器人用戶

Human or Robot

本項目爲kaggle上Facebook舉行的一次比賽,地址見數據來源,完整代碼見個人github,歡迎來玩~html

代碼

  • 數據探索——Data_Exploration.ipynbpython

  • 數據預處理&特徵工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynbgit

  • 模型設計及評測——Model_Design.ipynbgithub

項目數據來源

項目所需額外工具包


因爲文章內容過長,因此分爲兩篇文章,總共包含四個部分

  • 數據探索

  • 數據預處理及特徵工程

  • 模型設計

  • 評估及總結

數據探索

import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display
df_bids = pd.read_csv('bids.csv', low_memory=False)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_bids.head()
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
df_train.head()
# df_train.dtypes
bidder_id payment_account address outcome
0 91a3c57b13234af24875c56fb7e2b2f4rb56a a3d2de7675556553a5f08e4c88d2c228754av a3d2de7675556553a5f08e4c88d2c228vt0u4 0.0
1 624f258b49e77713fc34034560f93fb3hu3jo a3d2de7675556553a5f08e4c88d2c228v1sga ae87054e5a97a8f840a3991d12611fdcrfbq3 0.0
2 1c5f4fc669099bfbfac515cd26997bd12ruaj a3d2de7675556553a5f08e4c88d2c2280cybl 92520288b50f03907041887884ba49c0cl0pd 0.0
3 4bee9aba2abda51bf43d639013d6efe12iycd 51d80e233f7b6a7dfdee484a3c120f3b2ita8 4cb9717c8ad7e88a9a284989dd79b98dbevyi 0.0
4 4ab12bc61c82ddd9c2d65e60555808acqgos1 a3d2de7675556553a5f08e4c88d2c22857ddh 2a96c3ce94b3be921e0296097b88b56a7x1ji 0.0

異常數據檢測

# 查看各表格中是否存在空值
print 'Is there any missing value in bids?',df_bids.isnull().any().any()
print 'Is there any missing value in train?',df_train.isnull().any().any()
print 'Is there any missing value in test?',df_test.isnull().any().any()
Is there any missing value in bids? True
Is there any missing value in train? False
Is there any missing value in test? False

整個對三個數據集進行空值判斷,發現用戶數據訓練集和測試集均完好失數據,而在競標行爲數據集中存在缺失值的狀況,下面便針對bids數據進一步尋找缺失值

# nan_rows = df_bids[df_bids.isnull().T.any().T]
# print nan_rows
pd.isnull(df_bids).any()
bid_id         False
bidder_id      False
auction        False
merchandise    False
device         False
time           False
country         True
ip             False
url            False
dtype: bool
missing_country = df_bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = df_bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country:  8859
No. of normal country:  7647475
import matplotlib.pyplot as plt
labels = ['unknown', 'normal']
sizes = [missing_country, normal_country]
explode = (0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.title('Distribution of missing countries vs. normal countries')
plt.show()

empty value

綜合上述的分析能夠發現,在競標行爲用戶的country一欄屬性中存在不多一部分用戶行爲是沒有country記錄的,在預處理部分能夠針對這部分缺失數據進行填充操做,有兩種思路:

  • 針對原始行爲數據按照用戶分組後,看看每一個對應的用戶競標時常常所位於的國家信息,對缺失值填充常駐國家

  • 針對原始行爲數據按照用戶分組後,按時間順序對每組用戶中的缺失值前向或後向填充相鄰的國家信息

# 查看各個數據的記錄數
# 看看數據的id是不是惟一標識
print df_bids.shape[0]
print len(df_bids['bid_id'].unique())
print df_train.shape[0]
print len(df_train['bidder_id'].unique())
print df_test.shape[0]
print len(df_test['bidder_id'].unique())
7656334
7656334
2013
2013
4700
4700
# 簡單統計各項基本特徵(類別特徵)的數目(除去時間)
print 'total bidder in bids: ', len(df_bids['bidder_id'].unique())
print 'total auction in bids: ', len(df_bids['auction'].unique())
print 'total merchandise in bids: ', len(df_bids['merchandise'].unique())
print 'total device in bids: ', len(df_bids['device'].unique())
print 'total country in bids: ', len(df_bids['country'].unique())
print 'total ip in bids: ', len(df_bids['ip'].unique())
print 'total url in bids: ', len(df_bids['url'].unique())
total bidder in bids:  6614
total auction in bids:  15051
total merchandise in bids:  10
total device in bids:  7351
total country in bids:  200
total ip in bids:  2303991
total url in bids:  1786351

由上述基本特徵能夠看到:

  • 競標行爲中的用戶總數少於訓練集+測試集的用戶數,也就是說並非一一對應的,接下來驗證下競標行爲數據中的用戶是否徹底來自訓練集和測試集

  • 商品類別和國家的種類相對其餘特徵較少,能夠做爲自然的類別特徵提取出來進行處理,而其他的特徵可能更多的進行計數統計

lst_all_users = (df_train['bidder_id'].unique()).tolist() + (df_test['bidder_id'].unique()).tolist()
print 'total bidders of train and test set',len(lst_all_users)
lst_bidder = (df_bids['bidder_id'].unique()).tolist()
print 'total bidders in bids set',len(lst_bidder)
print 'Is bidders in bids are all from train+test set? ',set(lst_bidder).issubset(set(lst_all_users))
total bidders of train and test set 6713
total bidders in bids set 6614
Is bidders in bids are all from train+test set?  True
lst_nobids = [i for i in lst_all_users if i not in lst_bidder]
print 'No. of bidders never bid: ',len(lst_nobids)
lst_nobids_train = [i for i in lst_nobids if i in (df_train['bidder_id'].unique()).tolist()]
lst_nobids_test = [i for i in lst_nobids if i in (df_test['bidder_id'].unique()).tolist()]
print 'No. of bidders never bid in train set: ',len(lst_nobids_train)
print 'No. of bidders never bid in test set: ',len(lst_nobids_test)
No. of bidders never bid:  99
No. of bidders never bid in train set:  29
No. of bidders never bid in test set:  70
data_source = ['train', 'test']
y_pos = np.arange(len(data_source))
num_never_bids = [len(lst_nobids_train), len(lst_nobids_test)]
plt.bar(y_pos, num_never_bids, align='center', alpha=0.5)
plt.xticks(y_pos, data_source)
plt.ylabel('bidders no bids')
plt.title('Source of no bids bidders')
plt.show()

source of no bid bidders

print df_train[(df_train['bidder_id'].isin(lst_nobids_train)) & (df_train['outcome']==1.0)]
Empty DataFrame
Columns: [bidder_id, payment_account, address, outcome]
Index: []

由上述計算可知存在99個競標者無競標記錄,其中29位來自訓練集,70位來自測試集,並且這29位來自訓練集的競標者未被標記爲機器人用戶,因此能夠針對測試集中的這70位用戶後續標記爲人類或者取平均值處理

# check the partition of bots in train
print (df_train[df_train['outcome'] == 1].shape[0]*1.0) / df_train.shape[0] * 100,'%'
5.11674118231 %

訓練集中的標記爲機器人的用戶佔全部用戶數目約5%

df_train.groupby('outcome').size().plot(labels=['Human', 'Robot'], kind='pie', autopct='%.2f', figsize=(4, 4), 
                                        title='Distribution of Human vs. Robots', legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f477135c5d0>

正負例比較

由上述訓練集中的正負例分佈能夠看到本數據集正負例比例失衡,因此後續考慮使用AUC(不受正負例比例影響)做爲評價指標,此外儘可能採用Gradient Boosting族模型來進行訓練

數據預處理與特徵工程

import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
bids = pd.read_csv('bids.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

處理缺失數據

針對前面數據探索部分所發現的競標行爲數據中存在的國家眷性缺失問題,考慮使用針對原始行爲數據按照用戶分組後,按時間順序對每組用戶中的缺失值前向或後向填充相鄰的國家信息的方法來進行缺失值的填充處理

display(bids.head())
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
# pd.algos.is_monotonic_int64(bids.time.values, True)[0]
print 'Is the time monotonically non-decreasing? ', pd.Index(bids['time']).is_monotonic
Is the time monotonically non-decreasing?  False
# bidder_group = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].ffill()
bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].bfill()
display(bids.head())
bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
print 'Is there any missing value in bids?',bids.isnull().any().any()
# pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? True
missing_country = bids['country'].isnull().sum().sum()
print 'No. of missing country: ', missing_country
normal_country = bids['country'].notnull().sum().sum()
print 'No. of normal country: ', normal_country
No. of missing country:  5
No. of normal country:  7656329
nan_rows = bids[bids.isnull().T.any().T]
print nan_rows
bid_id                              bidder_id auction  \
1351177  1351177  f3ab8c9ecc0d021ebc81e89f20c8267bn812w   jefix   
2754184  2754184  88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2   cc5fs   
2836631  2836631  29b8af2fea3881ef61911612372dac41vczqv   jqx39   
3125892  3125892  df20f216cbb0b0df5a7b2e94b16a7853iyw9g   jqx39   
5153748  5153748  5e05ec450e2dd64d7996a08bbbca4f126nzzk   jqx39   

              merchandise    device              time country  \
1351177  office equipment   phone84  9767200789473684     NaN   
2754184            mobile  phone150  9633363947368421     NaN   
2836631           jewelry   phone72  9634034894736842     NaN   
3125892   books and music  phone106  9635755105263157     NaN   
5153748            mobile  phone267  9645270210526315     NaN   

                      ip              url  
1351177   80.211.119.111  g9pgdfci3yseml5  
2754184     20.67.240.88  ctivbfq55rktail  
2836631  149.210.107.205  vasstdc27m7nks3  
3125892      26.23.62.59  ac9xlqtfg0cx5c5  
5153748     145.7.194.40  0em0vg1f0zuxonw
# print bids[bids['bid_id']==1351177]
nan_bidder = nan_rows['bidder_id'].values.tolist()
# print nan_bidder
print bids[bids['bidder_id'].isin(nan_bidder)]
bid_id                              bidder_id auction  \
1351177  1351177  f3ab8c9ecc0d021ebc81e89f20c8267bn812w   jefix   
2754184  2754184  88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2   cc5fs   
2836631  2836631  29b8af2fea3881ef61911612372dac41vczqv   jqx39   
3125892  3125892  df20f216cbb0b0df5a7b2e94b16a7853iyw9g   jqx39   
5153748  5153748  5e05ec450e2dd64d7996a08bbbca4f126nzzk   jqx39   

              merchandise    device              time country  \
1351177  office equipment   phone84  9767200789473684     NaN   
2754184            mobile  phone150  9633363947368421     NaN   
2836631           jewelry   phone72  9634034894736842     NaN   
3125892   books and music  phone106  9635755105263157     NaN   
5153748            mobile  phone267  9645270210526315     NaN   

                      ip              url  
1351177   80.211.119.111  g9pgdfci3yseml5  
2754184     20.67.240.88  ctivbfq55rktail  
2836631  149.210.107.205  vasstdc27m7nks3  
3125892      26.23.62.59  ac9xlqtfg0cx5c5  
5153748     145.7.194.40  0em0vg1f0zuxonw

在對總體數據的部分用戶缺失國家的按照各個用戶分組後在時間上前向和後向填充後,仍然存在5個用戶缺失了國家信息,結果發現這5個用戶是僅有一次競標行爲,下面看看這5個用戶還有什麼特徵

lst_nan_train = [i for i in nan_bidder if i in (train['bidder_id'].unique()).tolist()]
lst_nan_test = [i for i in nan_bidder if i in (test['bidder_id'].unique()).tolist()]
print 'No. of bidders 1 bid in train set: ',len(lst_nan_train)
print 'No. of bidders 1 bid in test set: ',len(lst_nan_test)
No. of bidders 1 bid in train set:  1
No. of bidders 1 bid in test set:  4
print train[train['bidder_id']==lst_nan_train[0]]['outcome']
546    0.0
Name: outcome, dtype: float64

因爲這5個用戶僅有一次競標行爲,並且其中1個用戶來自訓練集,4個來自測試集,由訓練集用戶的標記爲人類,加上行爲數太少,因此考慮對這5個用戶的競標行爲數據予以捨棄,特別對測試集的4個用戶後續操做相似以前對無競標行爲的用戶,預測值填充最終模型的平均預測值

bid_to_drop = nan_rows.index.values.tolist()
# print bid_to_drop
bids.drop(bids.index[bid_to_drop], inplace=True)
print 'Is there any missing value in bids?',bids.isnull().any().any()
pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? False

統計基本的計數特徵

根據前面的數據探索,因爲數據集大部分由類別數據或者離散型數據構成,因此首先針對競標行爲數據按照競標者分組統計其各項屬性的數目,好比使用設備種類,參與競標涉及國家,ip種類等等

# group by bidder to do some statistics
bidders = bids.groupby('bidder_id')
# pickle.dump(bids, open('bidders.pkl', 'w'))
# print bidders['device'].count()
def feature_count(group):
    dct_cnt = {}
    dct_cnt['devices_c'] = group['device'].unique().shape[0]
    dct_cnt['countries_c'] = group['country'].unique().shape[0]
    dct_cnt['ip_c'] = group['ip'].unique().shape[0]
    dct_cnt['url_c'] = group['url'].unique().shape[0]    
    dct_cnt['auction_c'] = group['auction'].unique().shape[0]
    dct_cnt['auc_mean'] = np.mean(group['auction'].value_counts())    # bids_c/auction_c
#     dct_cnt['dev_mean'] = np.mean(group['device'].value_counts())    # bids_c/devices_c
    dct_cnt['merch_c'] = group['merchandise'].unique().shape[0]
    dct_cnt['bids_c'] = group.shape[0]
    dct_cnt = pd.Series(dct_cnt)
    return dct_cnt
cnt_bidder = bidders.apply(feature_count)
display(cnt_bidder.describe())
# cnt_bidder.to_csv('cnt_bidder.csv')
# print cnt_bidder[cnt_bidder['merch_c']==2]
auc_mean auction_c bids_c countries_c devices_c ip_c merch_c url_c
count 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000
mean 6.593493 57.850810 1158.470117 12.733848 73.492359 544.507187 1.000151 290.964140
std 30.009242 131.814053 9596.595169 22.556570 172.171106 3370.730666 0.012301 2225.912425
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 1.000000 2.000000 3.000000 1.000000 2.000000 2.000000 1.000000 1.000000
50% 1.677419 10.000000 18.000000 3.000000 8.000000 12.000000 1.000000 5.000000
75% 4.142857 47.000000 187.000000 12.000000 57.000000 111.000000 1.000000 36.000000
max 1327.366667 1726.000000 515033.000000 178.000000 2618.000000 111918.000000 2.000000 81376.000000

特徵相關性

在對競標行爲數據按照用戶分組後,對數據集中的每個產品特徵構建一個散佈矩陣(scatter matrix),來看看各特徵之間的相關性

# 對於數據中的每一對特徵構造一個散佈矩陣
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');

scatter matrix

在針對競標行爲數據按照競標用戶進行分組基本統計後由上表能夠看出,此時並未考慮時間戳的情形下,有如下基本結論:

  • 由各項統計的最大值與中位值,75%值的比較能夠看到除了商品類別一項,其餘的幾項多少都存在一些異常數值,或許能夠做爲異常行爲進行觀察

  • 各特徵的傾斜度很大,考慮對特徵進行取對數的操做,並再次輸出散佈矩陣看看相關性。

  • 商品類別計數這一特徵的方差很小,並且從中位數乃至75%的統計來看,大多數用戶僅對同一類別商品進行拍賣,並且由於前面數據探索部分發現商品類別自己適合做爲類別數據,因此考慮分多個類別進行單獨統計,而在計數特徵中捨棄該特徵。

cnt_bidder.drop('merch_c', axis=1, inplace=True)
cnt_bidder = np.log(cnt_bidder)
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');

scatter matrix(log)

由上面的散佈矩陣能夠看到,個行爲特徵之間並無表現出很強的相關性,雖然其中的ip計數和競標計數,設備計數在進行對數操做處理以後表現出輕微的正相關性,可是因爲是在作了對數操做以後才體現,並且從圖中能夠看到並不是很強的相關性,因此保留這三個特徵。

針對前述的異常行爲,先從原train數據集中的機器人、人類中分別挑選幾個樣本進行追蹤觀察他們在按照bidders分組後的統計結果,對比看看

cnt_bidder.to_csv('cnt_bidder.csv')
# trace samples,first 2 bots, last 2 humen
indices = ['9434778d2268f1fa2a8ede48c0cd05c097zey','aabc211b4cf4d29e4ac7e7e361371622pockb',
           'd878560888b11447e73324a6e263fbd5iydo1','91a3c57b13234af24875c56fb7e2b2f4rb56a']

# build a DataFrame for the choosed indices
samples = pd.DataFrame(cnt_bidder.loc[indices], columns = cnt_bidder.keys()).reset_index(drop = True)
print "Chosen samples of training dataset:(first 2 bots, last 2 humen)"
display(samples)
Chosen samples of training dataset:(first 2 bots, last 2 humen)
auc_mean auction_c bids_c countries_c devices_c ip_c url_c
0 3.190981 5.594711 8.785692 4.174387 6.011267 8.147578 7.557995
1 2.780432 4.844187 7.624619 2.639057 3.178054 5.880533 1.609438
2 0.287682 1.098612 1.386294 1.098612 1.386294 1.386294 0.000000
3 0.287682 2.890372 3.178054 1.791759 2.639057 2.995732 0.000000

使用seaborn來對上面四個例子的熱力圖進行可視化,看看percentile的狀況

import matplotlib.pyplot as plt
import seaborn as sns

# look at percentile ranks
pcts = 100. * cnt_bidder.rank(axis=0, pct=True).loc[indices].round(decimals=3)
print pcts

# visualize percentiles with heatmap
sns.heatmap(pcts, yticklabels=['robot 1', 'robot 2', 'human 1', 'human 2'], annot=True, linewidth=.1, vmax=99, 
            fmt='.1f', cmap='YlGnBu')
plt.title('Percentile ranks of\nsamples\' feature statistics')
plt.xticks(rotation=45, ha='center');
auc_mean  auction_c  bids_c  \
bidder_id                                                            
9434778d2268f1fa2a8ede48c0cd05c097zey      94.9       94.6    97.0   
aabc211b4cf4d29e4ac7e7e361371622pockb      92.4       87.2    92.3   
d878560888b11447e73324a6e263fbd5iydo1      39.8       30.4    30.2   
91a3c57b13234af24875c56fb7e2b2f4rb56a      39.8       60.2    53.0   

                                       countries_c  devices_c  ip_c  url_c  
bidder_id                                                                   
9434778d2268f1fa2a8ede48c0cd05c097zey         95.4       95.6  96.7   97.4  
aabc211b4cf4d29e4ac7e7e361371622pockb         77.3       63.8  84.8   50.3  
d878560888b11447e73324a6e263fbd5iydo1         48.8       38.7  34.2   13.4  
91a3c57b13234af24875c56fb7e2b2f4rb56a         63.7       56.8  56.2   13.4

hot map

由上面的熱力圖對比能夠看到,機器人的各項統計指標除去商品類別上的統計之外,均比人類用戶要高,因此考慮據此設計基於基本統計指標規則的基準模型,其中最顯著的特徵差別應該是在auc_mean一項即用戶在各個拍賣場的平均競標次數,不妨先按照異常值處理的方法來找出上述基礎統計中的異常狀況

設計樸素分類器

因爲最終目的是從競標者中尋找到機器人用戶,而根據常識,機器人用戶的各項競標行爲的操做應該比人類要頻繁許多,因此能夠從異常值檢驗的角度來設計樸素分類器,根據以前針對不一樣用戶統計的基本特徵計數狀況,能夠先針對每個特徵找出其中的疑似異經常使用戶列表,最後整合各個特徵生成的用戶列表,認爲超過多個特徵異常的用戶爲機器人用戶。

# find the outliers for each feature
lst_outlier = []
for feature in cnt_bidder.keys():
    # percentile  25th
    Q1 = np.percentile(cnt_bidder[feature], 25)
    # percentile  75th
    Q3 = np.percentile(cnt_bidder[feature], 75)
    step = 1.5 * (Q3 - Q1)    
    # show outliers
    # print "Data points considered outliers for the feature '{}':".format(feature)
    display(cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))])
    lst_outlier += cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))].index.values.tolist()

再找到各類特徵的全部可能做爲‘異常值’的用戶id以後,能夠對其作一個基本統計,進一步找出其中超過某幾個特徵值均異常的用戶,通過測試,考慮到原始train集合裏bots用戶不到5%,因此最終肯定以不低於1個特徵值均異常的用戶做爲異經常使用戶的一個假設,由此與train集合裏的用戶進行交叉,能夠獲得一個用戶子集,能夠做爲樸素分類器的一個操做方法。

# print len(lst_outlier)
from collections import Counter
freq_outlier = dict(Counter(lst_outlier))
perhaps_outlier = [i for i in freq_outlier if freq_outlier[i] >= 1]
print len(perhaps_outlier)
214
# basic_pred = test[test['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
train_pred = train[train['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist()
print len(train_pred)
76

設計評價指標

根據前面數據探索知本實驗中的數據集的正負例比例約爲19:1,有些失衡,因此考慮使用auc這種不受正負例比例影響的評價指標做爲衡量標準,現針對所涉及的樸素分類器在原始訓練集上的表現獲得一個基準得分

from sklearn.metrics import roc_auc_score
y_true = train['outcome']
naive_pred = pd.DataFrame(columns=['bidder_id', 'prediction'])
naive_pred['bidder_id'] = train['bidder_id']
naive_pred['prediction'] = np.where(naive_pred['bidder_id'].isin(train_pred), 1.0, 0.0)
basic_pred = naive_pred['prediction']
print roc_auc_score(y_true, basic_pred)
0.54661464952

在通過上述對基本計數特徵的統計以後,目前還沒有針對非類別特徵:時間戳進行處理,而在以前的數據探索過程當中,針對商品類別和國家這兩個類別屬性,能夠將原始的單一特徵轉換爲多個特徵分別統計,此外,在上述分析過程當中,咱們發現針對用戶分組能夠進一步對於拍賣場進行分組統計。

  • 對時間戳進行處理

  • 針對商品類別、國家轉換爲多個類別分別進行統計

  • 按照用戶-拍賣場進行分組進一步統計

對時間戳進行處理

主要是分析各個競標行爲的時間間隔,即統計競標行爲表中在同一拍賣場的各個用戶之間的競標行爲間隔

而後針對每一個用戶對其餘用戶的時間間隔計算

  • 時間間隔均值

  • 時間間隔最大值

  • 時間間隔最小值

from collections import defaultdict

def generate_timediff():    
    bids_grouped = bids.groupby('auction')
    bds = defaultdict(list)
    last_row = None

    for bids_auc in bids_grouped:
        for i, row in bids_auc[1].iterrows():
            if last_row is None:
                last_row = row
                continue

            time_difference = row['time'] - last_row['time']
            bds[row['bidder_id']].append(time_difference)
            last_row = row

    df = []
    for key in bds.keys():
        df.append({'bidder_id': key, 'mean': np.mean(bds[key]),
                   'min': np.min(bds[key]), 'max': np.max(bds[key])})

    pd.DataFrame(df).to_csv('tdiff.csv', index=False)
generate_timediff()

因爲內容長度超過限制,後續內容請移步使用機器學習識別出拍賣場中做弊的機器人用戶(二)

相關文章
相關標籤/搜索