模型評估和驗證 Model Evaluation and Validation

訓練集和測試集

爲何要分訓練集和測試集

在機器學習中,咱們通常要將數據分爲訓練集和測試集。在訓練集上訓練模型,而後在測試集上測試模型。咱們訓練模型的目的是用訓練好的模型幫助咱們在後續的實踐中作出準確的預測,因此咱們但願模型可以在從此的實際使用中有很好的性能,而不是隻在訓練集上有良好的性能。若是模型在學習中過於關注訓練集,那就會只是死記硬背地將整個訓練集背下來,而不是去理解數據集的內在結構。這樣的模型可以對訓練集有很是好的掌握,但對訓練集外沒有記憶過的數據毫無判斷能力。這就和人在學習中只去背誦題目答案,而不去理解解題思路同樣。這種學習方法是沒法在實際工做中取得好的成績的。html

咱們爲了判斷一個模型是隻會死記硬背仍是學會了數據的內在結構,就須要用測試集來檢查模型是否能對沒有學習過的數據進行準確判斷。算法

如何進行訓練集和測試集的劃分

from sklearn.model_selection import train_test_split
from numpy import random
random.seed(2)
X = random.random(size=(12,4))
y = random.random(size=(12,1))
X_train, X_test, y_train,  y_test = train_test_split(X,y,test_size=0.25)
print ('X_train:\n')
print (X_train)
print ('\ny_train:\n')
print (y_train)
print ('\nX_test:\n')
print (X_test)
print ('\ny_test:\n')
print (y_test)
X_train:

[[ 0.4203678   0.33033482  0.20464863  0.61927097]
 [ 0.22030621  0.34982629  0.46778748  0.20174323]
 [ 0.12715997  0.59674531  0.226012    0.10694568]
 [ 0.4359949   0.02592623  0.54966248  0.43532239]
 [ 0.79363745  0.58000418  0.1622986   0.70075235]
 [ 0.13457995  0.51357812  0.18443987  0.78533515]
 [ 0.64040673  0.48306984  0.50523672  0.38689265]
 [ 0.50524609  0.0652865   0.42812233  0.09653092]
 [ 0.85397529  0.49423684  0.84656149  0.07964548]]

y_train:

[[ 0.95374223]
 [ 0.02720237]
 [ 0.40627504]
 [ 0.53560417]
 [ 0.06714437]
 [ 0.08209492]
 [ 0.24717724]
 [ 0.8508505 ]
 [ 0.3663424 ]]

X_test:

[[ 0.29965467  0.26682728  0.62113383  0.52914209]
 [ 0.96455108  0.50000836  0.88952006  0.34161365]
 [ 0.56714413  0.42754596  0.43674726  0.77655918]]

y_test:

[[ 0.54420816]
 [ 0.99385201]
 [ 0.97058031]]

sklearn.model_selection.train_test_split(arrays, *options)[source]segmentfault

Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.app

Parameters:
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
Returns:
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Confusion Matrix (Error Matrix)

Referencedom

A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:機器學習

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.
imageide

Several standard terms have been defined for the 2 class matrix:函數

  • The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:post

$$AC=\frac{a+d}{a+b+c+d}$$性能

  • The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:

$$Recall=\frac{d}{c+d}$$

  • precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
    $$P=\frac{d}{b+d}$$

The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998).
Accuracy在negative cases遠多於positive cases的時候是不合適的,由於即便true prositive爲0,accuracy依然能夠很高。

Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), and F-Measure (Lewis and Gale, 1994).

$$g-mean=\sqrt{R\cdot P}$$

$$F_{\beta}=\frac{(\beta^2+1)PR}{\beta^2P+R}$$

F1 Score 就是F-Measure當$$\beta = 1$$時的特例

sklearn.metrics.confusion_matrix

ompute confusion matrix to evaluate the accuracy of a classification
By definition a confusion matrix C is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.
Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
Read more in the User Guide.

Parameters
y_true : array, shape = [n_samples]
Ground truth (correct) target values.
y_pred : array, shape = [n_samples]
Estimated targets as returned by a classifier.
labels : array, shape = [n_classes], optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight : array-like of shape = [n_samples], optional
Sample weights.
Returns: C : array, shape = [n_classes, n_classes]
Confusion matrix

Examples

from sklearn.metrics import confusion_matrix
y_true = [1, 0, 0, 1, 0, 1]
confusion_matrix(y_true, y_pred)
array([[2, 1],
       [1, 2]])
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

分類器性能指標之ROC曲線、AUC值

一 roc曲線

roc曲線:接收者操做特徵(receiver operating characteristic),roc曲線上每一個點反映着對同一信號刺激的感覺性。

  • 橫軸:負正類率(false postive rate FPR)特異度,劃分實例中全部負例佔全部負例的比例;(1-Specificity)

  • 縱軸:真正類率(true postive rate TPR)靈敏度,Sensitivity(正類覆蓋率)

針對一個二分類問題,將實例分紅正類(postive)或者負類(negative)。可是實際中分類時,會出現四種狀況.

  1. TP:正確的確定數目
    若一個實例是正類而且被預測爲正類,即爲真正類(True Postive TP)

  • FN:漏報,沒有找到正確匹配的數目
    若一個實例是正類,可是被預測成爲負類,即爲假負類(False Negative FN)

  • FP:誤報,沒有的匹配不正確
    若一個實例是負類,可是被預測成爲正類,即爲假正類(False Postive FP)

  • TN:正確拒絕的非匹配數目
    若一個實例是負類,可是被預測成爲負類,即爲真負類(True Negative TN)

列聯表以下,1表明正類,0表明負類:
image

由上表可得出橫,縱軸的計算公式:
(1)真正類率(True Postive Rate)TPR: TP/(TP+FN),表明分類器預測的正類中實際正實例佔全部正實例的比例。Sensitivity

(2)負正類率(False Postive Rate)FPR: FP/(FP+TN),表明分類器預測的正類中實際負實例佔全部負實例的比例。1-Specificity

(3)真負類率(True Negative Rate)TNR: TN/(FP+TN),表明分類器預測的負類中實際負實例佔全部負實例的比例,TNR=1-FPR。Specificity

假設採用邏輯迴歸分類器,其給出針對每一個實例爲正類的機率,那麼經過設定一個閾值如0.6,機率大於等於0.6的爲正類,小於0.6的爲負類。對應的就能夠算出一組(FPR,TPR),在平面中獲得對應座標點。隨着閾值的逐漸減少,愈來愈多的實例被劃分爲正類,可是這些正類中一樣也摻雜着真正的負實例,即TPR和FPR會同時增大。閾值最大時,對應座標點爲(0,0),閾值最小時,對應座標點(1,1)。

以下面這幅圖,(a)圖中實線爲ROC曲線,線上每一個點對應一個閾值。

橫軸FPR:1-TNR,1-Specificity,FPR越大,預測正類中實際負類越多

縱軸TPR:Sensitivity(正類覆蓋率),TPR越大,預測正類中實際正類越多

理想目標:TPR=1,FPR=0,即圖中(0,1)點,故ROC曲線越靠攏(0,1)點,越偏離45度對角線越好,Sensitivity、Specificity越大效果越好。

如何畫roc曲線

對於一個特定的分類器和測試數據集,顯然只能獲得一個分類結果,即一組FPR和TPR結果,而要獲得一個曲線,咱們實際上須要一系列FPR和TPR的值,這又是如何獲得的呢?咱們先來看一下Wikipedia上對ROC曲線的定義:

問題在於「as its discrimination threashold is varied」。如何理解這裏的「discrimination threashold」呢?咱們忽略了分類器的一個重要功能「機率輸出」,即表示分類器認爲某個樣本具備多大的機率屬於正樣本(或負樣本)。經過更深刻地瞭解各個分類器的內部機理,咱們總能想辦法獲得一種機率輸出。一般來講,是將一個實數範圍經過某個變換映射到(0,1)區間。

假如咱們已經獲得了全部樣本的機率輸出(屬於正樣本的機率),如今的問題是如何改變「discrimination threashold」?咱們根據每一個測試樣本屬於正樣本的機率值從大到小排序。下圖是一個示例,圖中共有20個測試樣本,「Class」一欄表示每一個測試樣本真正的標籤(p表示正樣本,n表示負樣本),「Score」表示每一個測試樣本屬於正樣本的機率.

接下來,咱們從高到低,依次將「Score」值做爲閾值threshold,當測試樣本屬於正樣本的機率大於或等於這個threshold時,咱們認爲它爲正樣本,不然爲負樣本。舉例來講,對於圖中的第4個樣本,其「Score」值爲0.6,那麼樣本1,2,3,4都被認爲是正樣本,由於它們的「Score」值都大於等於0.6,而其餘樣本則都認爲是負樣本。每次選取一個不一樣的threshold,咱們就能夠獲得一組FPR和TPR,即ROC曲線上的一點。這樣一來,咱們一共獲得了20組FPR和TPR的值,將它們畫在ROC曲線的結果以下圖:

當咱們將threshold設置爲1和0時,分別能夠獲得ROC曲線上的(0,0)和(1,1)兩個點。將這些(FPR,TPR)對鏈接起來,就獲得了ROC曲線。當threshold取值越多,ROC曲線越平滑。

其實,咱們並不必定要獲得每一個測試樣本是正樣本的機率值,只要獲得這個分類器對該測試樣本的「評分值」便可(評分值並不必定在(0,1)區間)。評分越高,表示分類器越確定地認爲這個測試樣本是正樣本,並且同時使用各個評分值做爲threshold。我認爲將評分值轉化爲機率更易於理解一些。

AUC

AUC(Area under Curve): Roc曲線下的面積,介於0.1和1之間。Auc做爲數值能夠直觀的評價分類器的好壞,值越大越好。

首先AUC值是一個機率值,當你隨機挑選一個正樣本以及負樣本,當前的分類算法根據計算獲得的Score值將這個正樣本排在負樣本前面的機率就是AUC值,AUC值越大,當前分類算法越有可能將正樣本排在負樣本前面,從而可以更好地分類。

爲何使用Roc和Auc評價分類器

既然已經這麼多標準,爲何還要使用ROC和AUC呢?由於ROC曲線有個很好的特性:當測試集中的正負樣本的分佈變換的時候,ROC曲線可以保持不變。在實際的數據集中常常會出現樣本類不平衡,即正負樣本比例差距較大,並且測試數據中的正負樣本也可能隨着時間變化。下圖是ROC曲線和Presision-Recall曲線的對比:

在上圖中,(a)和(c)爲Roc曲線,(b)和(d)爲Precision-Recall曲線。

(a)和(b)展現的是分類其在原始測試集(正負樣本分佈平衡)的結果,(c)(d)是將測試集中負樣本的數量增長到原來的10倍後,分類器的結果,能夠明顯的看出,ROC曲線基本保持原貌,而Precision-Recall曲線變化較大。

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred, pos_label=2)
print('gamma=1 AUC= ',metrics.auc(fpr, tpr))

clf=svm.SVC(kernel='rbf',C=1,gamma=5).fit(X_train,y_train)
y_pred_rbf=clf.predict(X_test)
fpr_ga1, tpr_ga1, thresholds_ga1 = metrics.roc_curve(y_test,y_pred_rbf, pos_label=2)
print('gamma=10 AUC= ',metrics.auc(fpr_ga1, tpr_ga1))

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 
y_pred_knn=neigh.predict(X_test)
fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve(y_test,y_pred_knn, pos_label=2)
print('knn AUC= ',metrics.auc(fpr_knn, tpr_knn))

plt.figure()
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr,label='gamma=1')
plt.plot(fpr_rbf,tpr_rbf,label='gamma=10')
plt.plot(fpr_knn,tpr_knn,label='knn')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
gamma=1 AUC=  0.927469135802
gamma=10 AUC=  0.927469135802
knn AUC=  0.936342592593

clipboard.png

從上圖能夠看出,SVM gamma 取10要明顯好於取1.

# Author: Tim Head <betatim@gmail.com>
#
# License: BSD 3 clause

import numpy as np
np.random.seed(10)

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
    random_state=0)

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)


# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

clipboard.png

clipboard.png

Errors

mean absoulte error 均絕對值偏差

絕對值函數不連續,因此沒法用在梯度降低中,取而代之用MSE

mean squre error 均方偏差

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
error=metrics.mean_absolute_error(y_test,y_pred)
print('mean_absolute_error: ',error)
print('mean_square_error: ',metrics.mean_squared_error(y_test,y_pred))
mean_absolute_error:  0.04
mean_square_error:  0.04

K Fold

from sklearn.model_selection import KFold
X=np.array([0,1,2,3,4,5,6,7,8,9])
kf=KFold(n_splits=10, random_state=3, shuffle=True)
for train_indices,test_indices in kf.split(X):
    print (train_indices,test_indices)
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 5 6 7 8 9] [4]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 3 4 5 6 7 8] [9]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[1 2 3 4 5 6 7 8 9] [0]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 4 5 6 7 9] [8]
相關文章
相關標籤/搜索