數據分析-信用卡反欺詐模型

本文經過利用信用卡的歷史交易數據進行機器學習,構建信用卡反欺詐預測模型,對客戶信用卡盜刷進行預測python

1、項目背景

對信用卡盜刷事情進行預測對於挽救客戶、銀行損失意義十分重大,此項目數據集來源於Kaggle,數據集包含由歐洲持卡人於2013年9月使用信用卡進行交的數據。此數據集顯示兩天內發生的交易,其中284,807筆交易中有492筆被盜刷。數據集很是不平衡,積極的類(被盜刷)佔全部交易的0.172%。因斷定信用卡持卡人信用卡是否會被盜刷爲二分類問題,解決分類問題咱們能夠有邏輯迴歸、SVM、隨機森林算法,也可利用boost集成學習的XGboost算法進行數據的訓練與判別,本文中採用邏輯迴歸算法進行測試。​算法

2、探索性數據分析

2.1 理解數據

import numpy as np
import pandas as pd
import sklearn as skl
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import gridspec
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report 
from sklearn.linear_model import LogisticRegression



data = pd.read_csv('../input/creditcard.csv')
data.info()
data.head()

可見數據共有31列,284807行,其中V1-V28爲結構化數據,另外一列爲整形數據,也爲類別屬性Class,Amount和Time的數據特徵與規格與其餘特徵不大相同。app

2.2 數據初步探索

print('No Frauds', round(data['Class'].value_counts()[0]/len(data) * 100,2), '% of the dataset')
print('Frauds', round(data['Class'].value_counts()[1]/len(data) * 100,2), '% of the dataset')
colors = ["#0101DF", "#DF0101"]

sns.countplot('Class', data=data, palette=colors)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)

No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset

能夠看出數據很不均衡:數據不均衡極可能致使咱們模型預測結果‘0’時很準確,而預測‘1’時並不許確。dom

2.3 數據預處理

由上圖知Fraud 與No Fraud的數據不均衡,若直接進行數據建模則會形成以下問題:機器學習

1)過擬合,因樣本中存在大量的正例(No Fraud),故機器可能過多學習正例的特徵而形成判斷錯誤學習

2)特徵關聯錯誤,因樣本數據不平衡,容易將不一樣的特徵屬性誤關聯測試

解決方案有以下:spa

1)Random undersampling,欠採樣3d

經過選取正例中與Fraud相同數量的樣本構成均衡的樣本,這樣新的樣本樣本中,Fraud、No Fraud各佔50%rest

2)Oversampling,過採樣

採用SMOTE算法,從少數類Fraud中建立合成點,以便在少數類和多數類之間達到平衡,沒必要刪除任何行,這與隨機欠採樣不一樣,也因爲沒有如前所述未刪除任何行,所以須要更多時間進行訓練

須要說明的是雖然咱們在實現Random Undersampling或Oversampling技術時對數據進行處理,但仍須要原始測試集上測試咱們的模型,而不是在欠採樣或過採樣上。欠採樣和過採樣的做用是使構建的模型合適,最終仍是服務於原始數據集。以後對分別採用欠採樣和過採樣的方法進行數據處理、建模,再對原始數據集進行預測分析。

此外還發現Amount列數值較大,不符合數據類似性原則,所以須要歸一化使其範圍在(-1, 1),不然在進行皮爾遜相關性分析時會出錯。

3、特徵工程

1. 繪製各類特徵的分佈圖

1 )查看盜刷與正常刷卡的刷卡金額分佈圖

f,(ax1,ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))
bins=30
ax1.hist(data[data.Class ==1]['Amount'],bins=bins)
ax1.set_title('Fraud')
ax2.hist(data[data.Class == 0]['Amount'], bins=bins)
ax2.set_title('Normal')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()

\(信用卡被盜刷發生的金額與信用卡正經常使用戶發生的金額相比,比較小。這說明信用卡盜刷者爲了避免引發信用卡卡主的注意,更偏向選擇小金額消費\)

2) 查看交易時間與交易金額的分佈圖

fig, ax = plt.subplots(1, 2, figsize=(18,4))

amount_val = data['Amount'].values
time_val = data['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])

plt.show()

3) 查看其餘特徵的分佈圖

plt.figure(figsize=(12,28*4))
v_features = data.ix[:,1:29].columns
gs = gridspec.GridSpec(28, 1)
for i, cn in enumerate(data[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(data[data.Class == 1][cn], bins=50)
    sns.distplot(data[data.Class == 0][cn], bins=50)
    ax.set_xlabel('')
    ax.set_title('histogram of feature:' + str(cn))
plt.show()

2. 處理不平衡數據

對Amount進行歸一化,使其範圍在(-1, 1)

from sklearn.preprocessing import StandardScaler

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)

此外還須要進行採用處理,分爲欠採樣處理和過採樣處理,Random undersampling 和 Oversampling,這裏先按照欠處理採樣進行分析。所以咱們獲得一個正反例平衡的數據樣本。

X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']

number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

normal_indices = data[data.Class == 0].index

random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])

under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']


X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample ,y_undersample, test_size = 0.3, random_state = 0)

3.相關性特徵分析

3.1 相關性分析

f, (ax1, ax2) = plt.subplots(2, 1, figsize=(24,20))

# Entire DataFrame
corr = data.corr()
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax1)
ax1.set_title("Imbalanced Correlation Matrix \n (don't use for reference)", fontsize=14)


sub_sample_corr = under_sample_data.corr()
sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax2)
ax2.set_title('SubSample Correlation Matrix \n (use for reference)', fontsize=14)
plt.show()

能夠看到如不對非平衡數據進行處理,有可能得出錯誤的相關性結論

經過heatmap發現,V14,V12和V10呈負相關,這意味着這些值越低,最終結果就越有可能成爲欺詐交易,V4,V11和V19正相關。注意這些值越高,最終結果越有可能成爲欺詐交易。

3.2 箱線圖

繪製上述的正相關、負相關的特徵屬性的箱線圖

f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Positive correlations (The higher the feature the probability increases that it will be a fraud transaction)
sns.boxplot(x="Class", y="V11", data=under_sample_data, palette=colors, ax=axes[0])
axes[0].set_title('V11 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V4", data=under_sample_data, palette=colors, ax=axes[1])
axes[1].set_title('V4 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V2", data=under_sample_data, palette=colors, ax=axes[2])
axes[2].set_title('V2 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V19", data=under_sample_data, palette=colors, ax=axes[3])
axes[3].set_title('V19 vs Class Positive Correlation')

plt.show()

f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Positive correlations (The higher the feature the probability increases that it will be a fraud transaction)
sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[0])
axes[0].set_title('V11 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1])
axes[1].set_title('V4 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[2])
axes[2].set_title('V2 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3])
axes[3].set_title('V19 vs Class Positive Correlation')

plt.show()

3.3 異常值處理

在異常值處理前,須要可視化咱們將要使用的特徵的。因爲上述V1四、V十二、V10爲負相關特徵,觀察這些特徵的分佈,與其餘相比,V14是惟一具備高斯分佈的特徵。根據四分位數肯定閥值,對在閥值外的異常數據進行剔除。

from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1,4, figsize=(20, 6))

v14_fraud_dist = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values
sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)

v12_fraud_dist = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values
sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14)


v10_fraud_dist = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values
sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9')
ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14)


plt.show()

查看特徵V1四、V十二、V10的分佈圖,選擇具備正態分佈特徵的V14

# -----> V14 Removing Outliers (Highest Negative Correlated with Labels)
v14_fraud = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values
q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75)
print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75))
v14_iqr = q75 - q25
print('iqr: {}'.format(v14_iqr))

v14_cut_off = v14_iqr * 1.5
v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off
print('Cut Off: {}'.format(v14_cut_off))
print('V14 Lower: {}'.format(v14_lower))
print('V14 Upper: {}'.format(v14_upper))

outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper]
print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers)))
print('V14 outliers:{}'.format(outliers))

under_sample_data = under_sample_data.drop(under_sample_data[(under_sample_data['V14'] > v14_upper) | (under_sample_data['V14'] < v14_lower)].index)
print('----' * 44)

# -----> V12 removing outliers from fraud transactions
v12_fraud = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values
q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75)
v12_iqr = q75 - q25

v12_cut_off = v12_iqr * 1.5
v12_lower, v12_upper = q25 - v12_cut_off, q75 + v12_cut_off
print('V12 Lower: {}'.format(v12_lower))
print('V12 Upper: {}'.format(v12_upper))
outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper]
print('V12 outliers: {}'.format(outliers))
print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers)))
new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V12'] > v12_upper) | (under_sample_data['V12'] < v12_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(under_sample_data)))
print('----' * 44)


# Removing outliers V10 Feature
v10_fraud = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values
q25, q75 = np.percentile(v10_fraud, 25), np.percentile(v10_fraud, 75)
v10_iqr = q75 - q25

v10_cut_off = v10_iqr * 1.5
v10_lower, v10_upper = q25 - v10_cut_off, q75 + v10_cut_off
print('V10 Lower: {}'.format(v10_lower))
print('V10 Upper: {}'.format(v10_upper))
outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper]
print('V10 outliers: {}'.format(outliers))
print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers)))
new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V10'] > v10_upper) | (under_sample_data['V10'] < v10_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(under_sample_data)))

去除異常值,顯示結果以下

f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6))

colors = ['#B3F9C5', '#f9c5b3']

sns.boxplot(x="Class", y="V14", data=under_sample_data,ax=ax1, palette=colors)
ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14)
ax1.annotate('Fewer extreme \n outliers', xy=(0.98, -17.5), xytext=(0, -12),
            arrowprops=dict(facecolor='black'),
            fontsize=14)


sns.boxplot(x="Class", y="V12", data=under_sample_data, ax=ax2, palette=colors)
ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14)
ax2.annotate('Fewer extreme \n outliers', xy=(0.98, -17.3), xytext=(0, -12),
            arrowprops=dict(facecolor='black'),
            fontsize=14)

sns.boxplot(x="Class", y="V10", data=under_sample_data, ax=ax3, palette=colors)
ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14)
ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12),
            arrowprops=dict(facecolor='black'),
            fontsize=14)


plt.show()

3.3 降維

降維的做用是將高維的數據集中沒必要要的特徵去除,只保留須要的特徵屬性。此處採用PCA和SVD進行降維分析

import time
from sklearn.decomposition import PCA, TruncatedSVD

# New_df is from the random undersample data (fewer instances)
X = under_sample_data.drop('Class', axis=1)
y = under_sample_data['Class']

# PCA Implementation
t0 = time.time()
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)
t1 = time.time()
print("PCA took {:.2} s".format(t1 - t0))

# TruncatedSVD
t0 = time.time()
X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)
t1 = time.time()
print("Truncated SVD took {:.2} s".format(t1 - t0))
import matplotlib.patches as mpatches

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(24,6))
# labels = ['No Fraud', 'Fraud']
f.suptitle('Clusters using Dimensionality Reduction', fontsize=14)


blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')


# PCA scatter plot
ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax1.set_title('PCA', fontsize=14)

ax1.grid(True)

ax1.legend(handles=[blue_patch, red_patch])

# TruncatedSVD scatter plot
ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2)
ax2.set_title('Truncated SVD', fontsize=14)

ax2.grid(True)

ax2.legend(handles=[blue_patch, red_patch])

plt.show()

4、數據建模與評估

4.1 K折交叉驗證

使用K折交叉驗證,對欠採樣樣本進行建模分析,以Recall值進行評估

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    c_param_range = [0.01,0.1,1,10,100] 
    
    results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range
    j = 0
    for c_param in c_param_range:
        recall_accs = []
        for iteration, indices in enumerate(fold.split(x_train_data),start=1):
            lr = LogisticRegression(C=c_param,penalty='l1')
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
    best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
    return best_c
best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

4.2 混淆矩陣

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
   
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
import itertools
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

4.3 閥值分析

調節LR模型的threshold,選擇最優的閥值。

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = [0,1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)

4.4 ROC曲線

分析模型的ROC曲線以及使用此模型對欠採樣獲得的數據集under_sample進行分析,結果以下。

lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values)
fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)
roc_auc = auc(fpr,tpr)
# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


4.4 原始數據分析

由以前的LR模型對原始數據進行預測分析

best_c = printing_Kfold_scores(X_train,y_train)
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
 
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()
lr = LogisticRegression(C = best_c, penalty = 'l1')
y_pred_score = lr.fit(X_train,y_train.values.ravel()).decision_function(X_test.values)
fpr, tpr, thresholds = roc_curve(y_test.values.ravel(),y_pred_score)
roc_auc = auc(fpr,tpr)
 
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

能夠看直接使用模型訓練並不能得到很好的召回率,也就是說並不能很好的辨別盜刷交易,下面採用前面所說的過採樣進行處理。

4.5 過採樣處理

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/creditcard.csv')

from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1)) 

data = data.drop(['Amount','Time'], axis=1)
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    c_param_range = [0.01,0.1,1,10,100] 
    
    results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score'])
    results_table['C_parameter'] = c_param_range
    j = 0
    for c_param in c_param_range:
        recall_accs = []
        for iteration, indices in enumerate(fold.split(x_train_data),start=1):
            lr = LogisticRegression(C=c_param,penalty='l1')
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration,': recall score = ', recall_acc)
        results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')
    best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']
    return best_c

def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
columns=data.columns
features_columns=columns.delete(len(columns)-1)
features=data[features_columns]

labels=data['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=0)

oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels)
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(os_features,os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test,y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

5、結論

  1. 在遇到數據不平衡時必定要進行處理,若按原數據集進行訓練有可能模型無心義,此外數據間差別過大須要進行歸一化。
  2. 經過下采樣(Random undersampling) 處理數據獲得的邏輯迴歸模型,雖然recall值挺高的,但NP值很是高,也就是誤殺率很是高。此爲下采樣處理數據的弊端,過採樣來處理數據,效果優於下采樣
  3. 邏輯迴歸模型中除了懲罰力度參數C須要整定,Threshold也能夠調,使得模型的各項指標最優
  4. 特徵工程中的降溫處理很是有必要,在高維數據集中包含許多沒必要要的數據時。
  5. 待完善的地方是須要結合不一樣的機器學習算法進行數據集訓練,如決策樹、SVM、隨機森林、GBDT、XGBoost等,綜合分析最好的模型。此外還可依據模型制定評分卡,經過評分卡對欺詐行爲進行預測。
相關文章
相關標籤/搜索