數據集包含由歐洲人於2013年9月使用信用卡進行交易的數據。此數據集顯示兩天內發生的交易,其中284807筆交易中有492筆被盜刷。數據集很是不平衡,正例(被盜刷)佔全部交易的0.172%。,這是由於因爲保密問題,咱們沒法提供有關數據的原始功能和更多背景信息。特徵V1,V2,... V28是使用PCA得到的主要組件,沒有用PCA轉換的惟一特徵是「Class」和「Amount」。特徵'Time'包含數據集中每一個刷卡時間和第一次刷卡時間之間通過的秒數。特徵'Class'是響應變量,若是發生被盜刷,則取值1,不然爲0。python
import pandas as pd import matplotlib.pyplot as plt import numpy as np
data = pd.read_csv("creditcard.csv") data.head()
在上圖中Class標籤表明數據分類,0表明正常數據,1表明欺詐數據。 學習
下面繪出柱狀圖能夠直觀顯示正常數據與異常數據的數量差別。 this
count_classes = pd.value_counts(data['Class'], sort=True).sort_index() count_classes.plot(kind='bar') # 使用pandas能夠繪製一些簡單的圖 # 欺詐類別柱狀圖 plt.title("Fraud class histogram") plt.xlabel("Class") # 頻率 plt.ylabel("Frequency")
# 預處理 標準化數據 from sklearn.preprocessing import StandardScaler # norm 標準 -1表示自動判斷X維度 對比源碼 這裏要加上.values
# 加上新的特徵列 data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1)) data = data.drop(['Time', 'Amount'], axis=1) data.head()
# loc 基於標籤索引 iloc 基於行號索引 # ix 基於行號和標籤索引都行 可是已被放棄 # X = data.ix[:, data.columns != 'Class'] # # print(X) # y = data.ix[:, data.columns == 'Class'] X = data.iloc[:, data.columns != 'Class'] # 特徵數據 # print(X) y = data.iloc[:, data.columns == 'Class'] # # Number of data points in the minority class 選取少部分異常數據集 number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) # Picking the indices of the normal classes 選取正常類的索引 normal_indices = data[data.Class == 0].index # Out of the indices we picked, randomly select "x" number (number_records_fraud) # 從正常類的索引中隨機選取 X 個數據 replace 代替的意思 random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False) random_normal_indices = np.array(random_normal_indices) # Appending the 2 indices under_sample_indices = np.concatenate([fraud_indices, random_normal_indices]) # Under sample dataset under_sample_data = data.iloc[under_sample_indices, :] X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class'] # Showing ratio transactions:交易 print( "Percentage of normal transactions:", len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data)) print( "Percentage of fraud transactions:", len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data)) print("Total number of transactions in resampled data:", len(under_sample_data))
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in resampled data: 984
# sklearn更新後在執行如下代碼時可能會出現這樣的問題: # from sklearn.cross_validation import train_test_split # ModuleNotFoundError: No module named 'sklearn.cross_validation' # 緣由新版本已經不支持 改成如下代碼 from sklearn.model_selection import train_test_split # Whole dataset test_size 表示訓練集測試集的比例 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) print("Number transactions train dataset:", len(X_train)) print("Number transactions test dataset:", len(X_test)) print("Total number of transactions:", len(X_train) + len(X_test)) # Undersampled dataset X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split( X_undersample, y_undersample, test_size=0.3, random_state=0) print("") print("Number transactions train dataset:", len(X_train_undersample)) print("Number transactions test dataset:", len(X_test_undersample)) print("Total number of transactions:", len(X_train_undersample) + len(X_test_undersample))
Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807
Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984
# Recall = TP/(TP+FN) Recall(召回率或查全率) from sklearn.linear_model import LogisticRegression # 使用邏輯迴歸模型 # from sklearn.cross_validation import KFold, cross_val_score 版本更新這行代碼也再也不支持 from sklearn.model_selection import KFold, cross_val_score # fold:摺疊 KFold 表示切分紅幾分數據進行交叉驗證 from sklearn.metrics import confusion_matrix, recall_score, classification_report
因而最主要就是須要設置當前懲罰的力度到底有多大?能夠設置成0.1,那麼懲罰力度就比較小,也能夠設置懲罰力度爲1,也能夠設置懲罰力度爲10。可是懲罰力度等於多少的時候,效果比較好呢?具體多少也不知道,須要經過交叉驗證,去評估一下什麼樣的參數達到更好的效果。C_param_range = [0.01,0.1,1,10,100]這裏就是前面提到的λ參數。須要將這5個參數不斷的嘗試。
def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(5,shuffle=False) # Different C parameters c_param_range = [0.01,0.1,1,10,100] result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall score']) result_table['C_parameter'] = c_param_range # the k-fold will give 2 lists:train_indices=indices[0],test_indices = indices[1] j=0 # 循環找到最好的懲罰力度 for c_param in c_param_range: print('-------------------------------------------') print('C parameter:',c_param) print('-------------------------------------------') print('') recall_accs = [] for iteration,indices in enumerate(fold.split(x_train_data)): # 使用特定的C參數調用邏輯迴歸模型 # Call the logistic regression model with a certain C parameter # 參數 solver=’liblinear’ 消除警告 # 出現警告:模型未能收斂 ,請增長收斂次數 # ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. # "the number of iterations.", ConvergenceWarning) # 增長參數 max_iter 默認1000 lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000) # Use the training data to fit the model. In this case, we use the portion # of the fold to train the model with indices[0], We then predict on the # portion assigned as the 'test cross validation' with indices[1] lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) # Predict values using the test indices in the training data y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) # Calculate the recall score and append it to a list for recall scores # representing the current c_parameter recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ',iteration,': recall score = ',recall_acc) # the mean value of those recall scores is the metric we want to save and get # hold of. result_table.loc[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ',np.mean(recall_accs)) print('') # 注意此處報錯 源代碼沒有astype('float64') best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter'] # Finally, we can check which C parameter is the best amongst the chosen. print('*********************************************************************************') print('Best model to choose from cross validation is with C parameter',best_c) print('*********************************************************************************') return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
C parameter: 0.01
Iteration 0 : recall score = 0.958904109589041
Iteration 1 : recall score = 0.9178082191780822
Iteration 2 : recall score = 1.0
Iteration 3 : recall score = 0.9864864864864865
Iteration 4 : recall score = 0.9545454545454546
Mean recall score 0.9635488539598128
C parameter: 0.1
Iteration 0 : recall score = 0.8356164383561644
Iteration 1 : recall score = 0.863013698630137
Iteration 2 : recall score = 0.9322033898305084
Iteration 3 : recall score = 0.9459459459459459
Iteration 4 : recall score = 0.8939393939393939
Mean recall score 0.8941437733404299
C parameter: 1
Iteration 0 : recall score = 0.8493150684931506
Iteration 1 : recall score = 0.863013698630137
Iteration 2 : recall score = 0.9830508474576272
Iteration 3 : recall score = 0.9459459459459459
Iteration 4 : recall score = 0.9090909090909091
Mean recall score 0.9100832939235539
C parameter: 10
Iteration 0 : recall score = 0.863013698630137
Iteration 1 : recall score = 0.863013698630137
Iteration 2 : recall score = 0.9830508474576272
Iteration 3 : recall score = 0.9324324324324325
Iteration 4 : recall score = 0.9242424242424242
Mean recall score 0.9131506202785514
C parameter: 100
Iteration 0 : recall score = 0.863013698630137
Iteration 1 : recall score = 0.863013698630137
Iteration 2 : recall score = 0.9830508474576272
Iteration 3 : recall score = 0.9459459459459459
Iteration 4 : recall score = 0.9242424242424242
Mean recall score 0.9158533229812542
Best model to choose from cross validation is with C parameter 0.01