如何在評估機器學習模型時防止數據泄漏

本文討論了評估模型性能時的數據泄漏問題以及避免數據泄漏的方法。微信

在模型評估過程當中，當訓練集的數據進入驗證/測試集時，就會發生數據泄漏。這將致使模型對驗證/測試集的性能評估存在誤差。讓咱們用一個使用Scikit-Learn的「波士頓房價」數據集的例子來理解它。數據集沒有缺失值，所以隨機引入100個缺失值，以便更好地演示數據泄漏。
dom

 import numpy as np
 import pandas as pd
 from sklearn.datasets import load_boston
 from sklearn.preprocessing import StandardScaler
 from sklearn.pipeline import Pipeline
 from sklearn.impute import SimpleImputer
 from sklearn.neighbors import KNeighborsRegressor
 from sklearn.model_selection import cross_validate, train_test_split
 from sklearn.metrics import mean_squared_error
 
 #Importing the dataset
 data = pd.DataFrame(load_boston()['data'],columns=load_boston()['feature_names'])
 data['target'] = load_boston()['target']
 
 
 #Split the input and target features
 X = data.iloc[:,:-1].copy()
 y = data.iloc[:,-1].copy()
 
 
 # Adding 100 random missing values
 np.random.seed(11)
 rand_cols = np.random.randint(0,X.shape[1],100)
 rand_rows = np.random.randint(0,X.shape[0],100)
 for i,j in zip(rand_rows,rand_cols):
    X.iloc[i,j] = np.nan
     
 #Splitting the data into training and test sets
 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=11)
 
 #Initislizing KNN Regressor
 knn = KNeighborsRegressor()
 
 #Initializing mode imputer
 imp = SimpleImputer(strategy='most_frequent')
 
 #Initializing StandardScaler
 standard_scaler = StandardScaler()
 
 #Imputing and scaling X_train
 X_train_impute = imp.fit_transform(X_train).copy()
 X_train_scaled = standard_scaler.fit_transform(X_train_impute).copy()
 
 #Running 5-fold cross-validation
 cv = cross_validate(estimator=knn,X=X_train_scaled,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)
 
 #Calculating mean of the training scores of cross-validation
 print(f'Training RMSE (with data leakage): {-1 * np.mean(cv["train_score"])}')
 
 #Calculating mean of the validation scores of cross-validation
 print(f'validation RMSE (with data leakage): {-1 * np.mean(cv["test_score"])}')
 
 #fitting the model to the training data
 lr.fit(X_train_scaled,y_train)
 
 #preprocessing the test data
 X_test_impute = imp.transform(X_test).copy()
 X_test_scaled = standard_scaler.transform(X_test_impute).copy()
 
 #Predictions and model evaluation on unseen data
 pred = lr.predict(X_test_scaled)
 print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代碼中，‘X_train’是訓練集(k-fold交叉驗證)，‘X_test’用於對看不見的數據進行模型評估。上面的代碼是一個帶有數據泄漏的模型評估示例，其中，用於估算缺失值的模式(strategy= ' most_frequent ')在' X_train '上計算。相似地，用於縮放數據的均值和標準誤差也使用' X_train '計算。' X_train的缺失值將被輸入，' X_train '在k-fold交叉驗證以前進行縮放。
ide

在k-fold交叉驗證中，' X_train '被分割成' k '摺疊。在每次k-fold交叉驗證迭代中，其中一個折用於驗證(咱們稱其爲驗證部分)，其他的折用於訓練(咱們稱其爲訓練部分)。每次迭代中的訓練和驗證部分都有已經使用' X_train '計算的模式輸入的缺失值。相似地，它們已經使用在' X_train '上計算的平均值和標準誤差進行了縮放。這種估算和縮放操做會致使來自' X_train '的信息泄露到k-fold交叉驗證的訓練和驗證部分。這種信息泄漏可能致使模型在驗證部分上的性能估計有誤差。下面的代碼展現了一種經過使用管道來避免它的方法。性能

 #Preprocessing and regressor pipeline
 pipeline = Pipeline(steps=[['imputer',imp],['scaler',standard_scaler],['regressor',knn]])
 
 #Running 5-fold cross-validation using pipeline as estimator
 cv = cross_validate(estimator=pipeline,X=X_train,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)
 
 #Calculating mean of the training scores of cross-validation
 print(f'Training RMSE (without data leakage): {-1 * np.mean(cv["train_score"])}')
 
 #Calculating mean of the validation scores of cross-validation
 print(f'validation RMSE (without data leakage): {-1 * np.mean(cv["test_score"])}')
 
 #fitting the pipeline to the training data
 pipeline.fit(X_train,y_train)
       
 #Predictions and model evaluation on unseen data
 pred = pipeline.predict(X_test)
 print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代碼中，咱們已經在管道中包含了輸入器、標量和迴歸器。在本例中，' X_train '被分割爲5個折，在每次迭代中，管道使用訓練部分計算用於輸入訓練和驗證部分中缺失值的模式。一樣，用於衡量訓練和驗證部分的平均值和標準誤差也在訓練部分上計算。這一過程消除了數據泄漏，由於在每次k-fold交叉驗證迭代中，都在訓練部分計算歸責模式和縮放的均值和標準誤差。在每次k-fold交叉驗證迭代中，這些值用於計算和擴展訓練和驗證部分。測試

咱們能夠看到在有數據泄漏和沒有數據泄漏的狀況下計算的訓練和驗證rmse的差別。因爲數據集很小，咱們只能看到它們之間的微小差別。在大數據集的狀況下，這個差別可能會很大。對於看不見的數據，驗證RMSE(帶有數據泄漏)接近RMSE只是偶然的。大數據

所以，使用管道進行k-fold交叉驗證能夠防止數據泄漏，並更好地評估模型在不可見數據上的性能。lua

做者：KSV Muralidharurl

原文地址：https://ksvmuralidhar.medium.com/how-to-avoid-data-leakage-while-evaluating-the-performance-of-a-machine-learning-model-ac30f2bb8586spa

deephub翻譯組.net

本文分享自微信公衆號 - DeepHub IMBA（deephub-imba）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。