Cross Validation交叉驗證

時間 2019-12-12

標籤 cross validation 交叉驗證简体版

原文原文鏈接

訓練集 vs. 測試集

在模式識別（pattern recognition）與機器學習（machine learning）的相關研究中，常常會將數據集（dataset）分爲訓練集（training set）跟測試集（testing set）這兩個子集，前者用以創建模型（model），後者則用來評估該模型對未知樣本進行預測時的精確度，正規的說法是泛化能力（generalization ability）。
怎麼將完整的數據集分爲訓練集跟測試集，必須遵照以下要點：html

一、只有訓練集才能夠用在模型的訓練過程當中，測試集則必須在模型完成以後才被用來評估模型優劣的依據。dom
二、訓練集中樣本數量必須夠多，通常至少大於總樣本數的50%。機器學習
三、兩組子集必須從完整集合中均勻取樣。ide

其中最後一點特別重要，均勻取樣的目的是但願減小訓練集/測試集與完整集合之間的誤差（bias），但卻也不易作到。通常的做法是隨機取樣，當樣本數量足夠時，即可達到均勻取樣的效果，然而隨機也正是此做法的盲點，也是常常是能夠在數據上作手腳的地方。舉例來講，當辨識率不理想時，便從新取樣一組訓練集/測試集，直到測試集的識別率滿意爲止，但嚴格來講這樣便算是做弊了。# 交叉驗證（Cross Validation）性能

交叉驗證（Cross Validation）是用來驗證分類器的性能一種統計分析方法，基本思想是把在某種意義下將原始數據（dataset）進行分組，一部分作爲訓練集（training set），另外一部分作爲驗證集（validation set），首先用訓練集對分類器進行訓練，在利用驗證集來測試訓練獲得的模型（model），以此來作爲評價分類器的性能指標。常見的交叉驗證方法以下：學習

Hold-Out Method

將原始數據隨機分爲兩組，一組作爲訓練集，一組作爲驗證集，利用訓練集訓練分類器，而後利用驗證集驗證模型，記錄最後的分類準確率爲此分類器的性能指標。此種方法的好處的處理簡單，只需隨機把原始數據分爲兩組便可，其實嚴格意義來講Hold-Out Method並不能算是CV，由於這種方法沒有達到交叉的思想，因爲是隨機的將原始數據分組，因此最後驗證集分類準確率的高低與原始數據的分組有很大的關係，因此這種方法獲得的結果其實並不具備說服性。測試

Double Cross Validation（2-fold Cross Validation，記爲2-CV）

作法是將數據集分紅兩個相等大小的子集，進行兩回合的分類器訓練。在第一回閤中，一個子集做爲training set，另外一個便做爲testing set；在第二回閤中，則將training set與testing set對換後，再次訓練分類器，而其中咱們比較關心的是兩次testing sets的辨識率。不過在實務上2-CV並不經常使用，主要緣由是training set樣本數太少，一般不足以表明母體樣本的分佈，致使testing階段辨識率容易出現明顯落差。此外，2-CV中分子集的變異度大，每每沒法達到「實驗過程必須能夠被複制」的要求。ui

K-fold Cross Validation（K-折交叉驗證，記爲K-CV）

將原始數據分紅K組（通常是均分），將每一個子集數據分別作一次驗證集，其他的K-1組子集數據做爲訓練集，這樣會獲得K個模型，用這K個模型最終的驗證集的分類準確率的平均數做爲此K-CV下分類器的性能指標。K通常大於等於2，實際操做時通常從3開始取，只有在原始數據集合數據量小的時候纔會嘗試取2。K-CV能夠有效的避免過學習以及欠學習狀態的發生，最後獲得的結果也比較具備說服性。this

K-fold cross-validation (k-CV)則是double cross-validation的延伸，做法是將dataset切成k個大小相等的subsets，每一個subset皆分別做爲一次test set，其他樣本則做爲training set，所以一次k-CV的實驗共須要創建k個models，並計算k次test sets的平均辨識率。在實做上，k要夠大才能使各回合中的training set樣本數夠多，通常而言k=10算是至關足夠了。lua

Leave-One-Out Cross Validation（記爲LOO-CV）

若是設原始數據有N個樣本，那麼LOO-CV就是N-CV，即每一個樣本單獨做爲驗證集，其他的N-1個樣本做爲訓練集，因此LOO-CV會獲得N個模型，用這N個模型最終的驗證集的分類準確率的平均數做爲此下LOO-CV分類器的性能指標。相比於前面的K-CV，LOO-CV有兩個明顯的優勢：

（1）每一回閤中幾乎全部的樣本皆用於訓練模型，所以最接近原始樣本的分佈，這樣評估所得的結果比較可靠。
（2）實驗過程當中沒有隨機因素會影響實驗數據，確保實驗過程是能夠被複制的。
但LOO-CV的缺點則是計算成本高，由於須要創建的模型數量與原始數據樣本數量相同，當原始數據樣本數量至關多時，LOO-CV在實做上便有困難幾乎就是不顯示，除非每次訓練分類器獲得模型的速度很快，或是能夠用並行化計算減小計算所需的時間。

使用Cross-Validation時常犯的錯誤

因爲實驗室許多研究都有用到 evolutionary algorithms（EA）與 classifiers，所使用的 fitness function 中一般都有用到 classifier 的辨識率，然而把cross-validation 用錯的案例還很多。前面說過，只有 training data 才能夠用於 model 的建構，因此只有 training data 的辨識率才能夠用在 fitness function 中。而 EA 是訓練過程用來調整 model 最佳參數的方法，因此只有在 EA結束演化後，model 參數已經固定了，這時候纔可使用 test data。那 EA 跟 cross-validation 要如何搭配呢？Cross-validation 的本質是用來估測(estimate)某個 classification method 對一組 dataset 的 generalization error，不是用來設計 classifier 的方法，因此 cross-validation 不能用在 EA的 fitness function 中，由於與 fitness function 有關的樣本都屬於 training set，那試問哪些樣本纔是 test set 呢？若是某個 fitness function 中用了cross-validation 的 training 或 test 辨識率，那麼這樣的實驗方法已經不能稱爲 cross-validation 了。

EA 與 k-CV 正確的搭配方法，是將 dataset 分紅 k 等份的 subsets 後，每次取 1份 subset 做爲 test set，其他 k-1 份做爲 training set，而且將該組 training set 套用到 EA 的 fitness function 計算中(至於該 training set 如何進一步利用則沒有限制)。所以，正確的 k-CV 會進行共 k 次的 EA 演化，創建 k 個classifiers。而 k-CV 的 test 辨識率，則是 k 組 test sets 對應到 EA 訓練所得的 k 個 classifiers 辨識率之平均值。

示例

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

((150, 4), (150,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)
X_train.shape, y_train.shape

((90, 4), (90,))

X_test.shape, y_test.shape

((60, 4), (60,))

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.96666666666666667

Computing cross-validated metrics

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

clf = svm.SVC(kernel='linear', C=1)
# 使用iris數據集對linear kernel的SVM模型作5折交叉驗證
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
scores

array([ 0.96666667, 1. , 0.96666667, 0.96666667, 1. ])
The mean score and the 95% confidence interval of the score estimate are hence given by:

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
See The scoring parameter: defining model evaluation rules for details. In the case of the Iris dataset, the samples are balanced across target classes hence the accuracy and the F1-score are almost equal.
When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.

#用各種的f1平均值作爲score
from sklearn import metrics
scores = cross_validation.cross_val_score(clf, iris.data, iris.target,cv=5, scoring='f1_weighted')
scores

array([ 0.96658312, 1. , 0.96658312, 0.96658312, 1. ])

用Crossvalidation作Gird Search

>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svr = svm.SVC()
>>> clf = GridSearchCV(svr, parameters,cv=5,scoring='f1_weighted')
>>> clf.fit(iris.data, iris.target)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_weighted', verbose=0)

# View the accuracy score
print('Best score for data:', clf.best_score_)
# View the best parameters for the model found using grid search
print('Best C:',clf.best_estimator_.C)
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

Best score for data: 0.979949874687
Best C: 1
Best Kernel: linear
Best Gamma: auto

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。