本文經過利用信用卡的歷史交易數據進行機器學習,構建信用卡反欺詐預測模型,對客戶信用卡盜刷進行預測python
對信用卡盜刷事情進行預測對於挽救客戶、銀行損失意義十分重大,此項目數據集來源於Kaggle,數據集包含由歐洲持卡人於2013年9月使用信用卡進行交的數據。此數據集顯示兩天內發生的交易,其中284,807筆交易中有492筆被盜刷。數據集很是不平衡,積極的類(被盜刷)佔全部交易的0.172%。因斷定信用卡持卡人信用卡是否會被盜刷爲二分類問題,解決分類問題咱們能夠有邏輯迴歸、SVM、隨機森林算法,也可利用boost集成學習的XGboost算法進行數據的訓練與判別,本文中採用邏輯迴歸算法進行測試。算法
import numpy as np import pandas as pd import sklearn as skl import seaborn as sns import matplotlib.pyplot as plt from matplotlib import gridspec import warnings warnings.filterwarnings("ignore") from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report from sklearn.linear_model import LogisticRegression data = pd.read_csv('../input/creditcard.csv') data.info() data.head()
可見數據共有31列,284807行,其中V1-V28爲結構化數據,另外一列爲整形數據,也爲類別屬性Class,Amount和Time的數據特徵與規格與其餘特徵不大相同。app
print('No Frauds', round(data['Class'].value_counts()[0]/len(data) * 100,2), '% of the dataset') print('Frauds', round(data['Class'].value_counts()[1]/len(data) * 100,2), '% of the dataset') colors = ["#0101DF", "#DF0101"] sns.countplot('Class', data=data, palette=colors) plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)
No Frauds 99.83 % of the dataset Frauds 0.17 % of the dataset
能夠看出數據很不均衡:數據不均衡極可能致使咱們模型預測結果‘0’時很準確,而預測‘1’時並不許確。dom
由上圖知Fraud 與No Fraud的數據不均衡,若直接進行數據建模則會形成以下問題:機器學習
1)過擬合,因樣本中存在大量的正例(No Fraud),故機器可能過多學習正例的特徵而形成判斷錯誤學習
2)特徵關聯錯誤,因樣本數據不平衡,容易將不一樣的特徵屬性誤關聯測試
解決方案有以下:spa
1)Random undersampling,欠採樣3d
經過選取正例中與Fraud相同數量的樣本構成均衡的樣本,這樣新的樣本樣本中,Fraud、No Fraud各佔50%rest
2)Oversampling,過採樣
採用SMOTE算法,從少數類Fraud中建立合成點,以便在少數類和多數類之間達到平衡,沒必要刪除任何行,這與隨機欠採樣不一樣,也因爲沒有如前所述未刪除任何行,所以須要更多時間進行訓練
須要說明的是雖然咱們在實現Random Undersampling或Oversampling技術時對數據進行處理,但仍須要原始測試集上測試咱們的模型,而不是在欠採樣或過採樣上。欠採樣和過採樣的做用是使構建的模型合適,最終仍是服務於原始數據集。以後對分別採用欠採樣和過採樣的方法進行數據處理、建模,再對原始數據集進行預測分析。
此外還發現Amount列數值較大,不符合數據類似性原則,所以須要歸一化使其範圍在(-1, 1),不然在進行皮爾遜相關性分析時會出錯。
f,(ax1,ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4)) bins=30 ax1.hist(data[data.Class ==1]['Amount'],bins=bins) ax1.set_title('Fraud') ax2.hist(data[data.Class == 0]['Amount'], bins=bins) ax2.set_title('Normal') plt.xlabel('Amount ($)') plt.ylabel('Number of Transactions') plt.yscale('log') plt.show()
\(信用卡被盜刷發生的金額與信用卡正經常使用戶發生的金額相比,比較小。這說明信用卡盜刷者爲了避免引發信用卡卡主的注意,更偏向選擇小金額消費\)
fig, ax = plt.subplots(1, 2, figsize=(18,4)) amount_val = data['Amount'].values time_val = data['Time'].values sns.distplot(amount_val, ax=ax[0], color='r') ax[0].set_title('Distribution of Transaction Amount', fontsize=14) ax[0].set_xlim([min(amount_val), max(amount_val)]) sns.distplot(time_val, ax=ax[1], color='b') ax[1].set_title('Distribution of Transaction Time', fontsize=14) ax[1].set_xlim([min(time_val), max(time_val)]) plt.show()
plt.figure(figsize=(12,28*4)) v_features = data.ix[:,1:29].columns gs = gridspec.GridSpec(28, 1) for i, cn in enumerate(data[v_features]): ax = plt.subplot(gs[i]) sns.distplot(data[data.Class == 1][cn], bins=50) sns.distplot(data[data.Class == 0][cn], bins=50) ax.set_xlabel('') ax.set_title('histogram of feature:' + str(cn)) plt.show()
對Amount進行歸一化,使其範圍在(-1, 1)
from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) data = data.drop(['Time','Amount'],axis=1)
此外還須要進行採用處理,分爲欠採樣處理和過採樣處理,Random undersampling 和 Oversampling,這裏先按照欠處理採樣進行分析。所以咱們獲得一個正反例平衡的數據樣本。
X = data.ix[:, data.columns != 'Class'] y = data.ix[:, data.columns == 'Class'] number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) normal_indices = data[data.Class == 0].index random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) random_normal_indices = np.array(random_normal_indices) under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) under_sample_data = data.iloc[under_sample_indices,:] X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0) X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample ,y_undersample, test_size = 0.3, random_state = 0)
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(24,20)) # Entire DataFrame corr = data.corr() sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax1) ax1.set_title("Imbalanced Correlation Matrix \n (don't use for reference)", fontsize=14) sub_sample_corr = under_sample_data.corr() sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax2) ax2.set_title('SubSample Correlation Matrix \n (use for reference)', fontsize=14) plt.show()
能夠看到如不對非平衡數據進行處理,有可能得出錯誤的相關性結論
經過heatmap發現,V14,V12和V10呈負相關,這意味着這些值越低,最終結果就越有可能成爲欺詐交易,V4,V11和V19正相關。注意這些值越高,最終結果越有可能成爲欺詐交易。
繪製上述的正相關、負相關的特徵屬性的箱線圖
f, axes = plt.subplots(ncols=4, figsize=(20,4)) # Positive correlations (The higher the feature the probability increases that it will be a fraud transaction) sns.boxplot(x="Class", y="V11", data=under_sample_data, palette=colors, ax=axes[0]) axes[0].set_title('V11 vs Class Positive Correlation') sns.boxplot(x="Class", y="V4", data=under_sample_data, palette=colors, ax=axes[1]) axes[1].set_title('V4 vs Class Positive Correlation') sns.boxplot(x="Class", y="V2", data=under_sample_data, palette=colors, ax=axes[2]) axes[2].set_title('V2 vs Class Positive Correlation') sns.boxplot(x="Class", y="V19", data=under_sample_data, palette=colors, ax=axes[3]) axes[3].set_title('V19 vs Class Positive Correlation') plt.show()
f, axes = plt.subplots(ncols=4, figsize=(20,4)) # Positive correlations (The higher the feature the probability increases that it will be a fraud transaction) sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[0]) axes[0].set_title('V11 vs Class Positive Correlation') sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V4 vs Class Positive Correlation') sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[2]) axes[2].set_title('V2 vs Class Positive Correlation') sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V19 vs Class Positive Correlation') plt.show()
在異常值處理前,須要可視化咱們將要使用的特徵的。因爲上述V1四、V十二、V10爲負相關特徵,觀察這些特徵的分佈,與其餘相比,V14是惟一具備高斯分佈的特徵。根據四分位數肯定閥值,對在閥值外的異常數據進行剔除。
from scipy.stats import norm f, (ax1, ax2, ax3) = plt.subplots(1,4, figsize=(20, 6)) v14_fraud_dist = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861') ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14) v12_fraud_dist = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB') ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14) v10_fraud_dist = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9') ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14) plt.show()
查看特徵V1四、V十二、V10的分佈圖,選擇具備正態分佈特徵的V14
# -----> V14 Removing Outliers (Highest Negative Correlated with Labels) v14_fraud = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75) print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75)) v14_iqr = q75 - q25 print('iqr: {}'.format(v14_iqr)) v14_cut_off = v14_iqr * 1.5 v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off print('Cut Off: {}'.format(v14_cut_off)) print('V14 Lower: {}'.format(v14_lower)) print('V14 Upper: {}'.format(v14_upper)) outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper] print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers))) print('V14 outliers:{}'.format(outliers)) under_sample_data = under_sample_data.drop(under_sample_data[(under_sample_data['V14'] > v14_upper) | (under_sample_data['V14'] < v14_lower)].index) print('----' * 44) # -----> V12 removing outliers from fraud transactions v12_fraud = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75) v12_iqr = q75 - q25 v12_cut_off = v12_iqr * 1.5 v12_lower, v12_upper = q25 - v12_cut_off, q75 + v12_cut_off print('V12 Lower: {}'.format(v12_lower)) print('V12 Upper: {}'.format(v12_upper)) outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper] print('V12 outliers: {}'.format(outliers)) print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V12'] > v12_upper) | (under_sample_data['V12'] < v12_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(under_sample_data))) print('----' * 44) # Removing outliers V10 Feature v10_fraud = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v10_fraud, 25), np.percentile(v10_fraud, 75) v10_iqr = q75 - q25 v10_cut_off = v10_iqr * 1.5 v10_lower, v10_upper = q25 - v10_cut_off, q75 + v10_cut_off print('V10 Lower: {}'.format(v10_lower)) print('V10 Upper: {}'.format(v10_upper)) outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper] print('V10 outliers: {}'.format(outliers)) print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V10'] > v10_upper) | (under_sample_data['V10'] < v10_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(under_sample_data)))
去除異常值,顯示結果以下
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6)) colors = ['#B3F9C5', '#f9c5b3'] sns.boxplot(x="Class", y="V14", data=under_sample_data,ax=ax1, palette=colors) ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14) ax1.annotate('Fewer extreme \n outliers', xy=(0.98, -17.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) sns.boxplot(x="Class", y="V12", data=under_sample_data, ax=ax2, palette=colors) ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14) ax2.annotate('Fewer extreme \n outliers', xy=(0.98, -17.3), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) sns.boxplot(x="Class", y="V10", data=under_sample_data, ax=ax3, palette=colors) ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14) ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) plt.show()
降維的做用是將高維的數據集中沒必要要的特徵去除,只保留須要的特徵屬性。此處採用PCA和SVD進行降維分析
import time from sklearn.decomposition import PCA, TruncatedSVD # New_df is from the random undersample data (fewer instances) X = under_sample_data.drop('Class', axis=1) y = under_sample_data['Class'] # PCA Implementation t0 = time.time() X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("PCA took {:.2} s".format(t1 - t0)) # TruncatedSVD t0 = time.time() X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values) t1 = time.time() print("Truncated SVD took {:.2} s".format(t1 - t0))
import matplotlib.patches as mpatches f, (ax1, ax2) = plt.subplots(1, 2, figsize=(24,6)) # labels = ['No Fraud', 'Fraud'] f.suptitle('Clusters using Dimensionality Reduction', fontsize=14) blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud') red_patch = mpatches.Patch(color='#AF0000', label='Fraud') # PCA scatter plot ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2) ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2) ax1.set_title('PCA', fontsize=14) ax1.grid(True) ax1.legend(handles=[blue_patch, red_patch]) # TruncatedSVD scatter plot ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2) ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2) ax2.set_title('Truncated SVD', fontsize=14) ax2.grid(True) ax2.legend(handles=[blue_patch, red_patch]) plt.show()
使用K折交叉驗證,對欠採樣樣本進行建模分析,以Recall值進行評估
def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(5,shuffle=False) c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range j = 0 for c_param in c_param_range: recall_accs = [] for iteration, indices in enumerate(fold.split(x_train_data),start=1): lr = LogisticRegression(C=c_param,penalty='l1') lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ', iteration,': recall score = ', recall_acc) results_table.ix[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ', np.mean(recall_accs)) print('') best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter'] return best_c best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') import itertools lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
調節LR模型的threshold,選擇最優的閥值。
lr = LogisticRegression(C = 0.01, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] plt.figure(figsize=(10,10)) j = 1 for i in thresholds: y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i plt.subplot(3,3,j) j += 1 # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)
分析模型的ROC曲線以及使用此模型對欠採樣獲得的數據集under_sample進行分析,結果以下。
lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred = lr.predict(X_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show() y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values) fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score) roc_auc = auc(fpr,tpr) # Plot ROC plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc) plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.0]) plt.ylim([-0.1,1.01]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
由以前的LR模型對原始數據進行預測分析
best_c = printing_Kfold_scores(X_train,y_train) lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train,y_train.values.ravel()) y_pred_undersample = lr.predict(X_test.values) cnf_matrix = confusion_matrix(y_test,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show() lr = LogisticRegression(C = best_c, penalty = 'l1') y_pred_score = lr.fit(X_train,y_train.values.ravel()).decision_function(X_test.values) fpr, tpr, thresholds = roc_curve(y_test.values.ravel(),y_pred_score) roc_auc = auc(fpr,tpr) plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc) plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.0]) plt.ylim([-0.1,1.01]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
能夠看直接使用模型訓練並不能得到很好的召回率,也就是說並不能很好的辨別盜刷交易,下面採用前面所說的過採樣進行處理。
import numpy as np import pandas as pd import matplotlib.pyplot as plt from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split data = pd.read_csv('../input/creditcard.csv') from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1)) data = data.drop(['Amount','Time'], axis=1)
import itertools from sklearn.linear_model import LogisticRegression from sklearn.model_selection import KFold, cross_val_score from sklearn.metrics import confusion_matrix,recall_score,classification_report def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(5,shuffle=False) c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range j = 0 for c_param in c_param_range: recall_accs = [] for iteration, indices in enumerate(fold.split(x_train_data),start=1): lr = LogisticRegression(C=c_param,penalty='l1') lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ', iteration,': recall score = ', recall_acc) results_table.ix[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ', np.mean(recall_accs)) print('') best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter'] return best_c def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')
columns=data.columns features_columns=columns.delete(len(columns)-1) features=data[features_columns] labels=data['Class'] features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=0) oversampler=SMOTE(random_state=0) os_features,os_labels=oversampler.fit_sample(features_train,labels_train) os_features = pd.DataFrame(os_features) os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels) lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(os_features,os_labels.values.ravel()) y_pred = lr.predict(features_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(labels_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()