08-07 細分構建機器學習應用程序的流程-測試模型

時間 2019-11-10

標籤細分構建機器學習應用程序流程測試模型简体版

原文原文鏈接

目錄html

更新、更全的《機器學習》的更新網站，更有python、go、數據結構與算法、爬蟲、人工智能教學等着你：http://www.javashuo.com/article/p-vozphyqp-cm.htmlpython

細分構建機器學習應用程序的流程-測試模型

對於分類問題，咱們可能會使用k近鄰算法、決策樹、邏輯迴歸、樸素貝葉斯法、支持向量機、隨機森林；對於迴歸問題，咱們可能會使用線性迴歸、決策樹、隨機森林。在工業上，咱們不可能會對客戶說，這是我訓練的幾個模型，你想用哪一個我就給你哪一個。通常而言這是不可能的，一般對於這幾個模型，咱們會經過某種度量模型的工具，選擇一個最優的模型推給客戶。算法

在訓練模型那一章節，對於每一個模型咱們都使用了模型自帶的score()方法對模型的性能進行了一個度量，可是score()方法對於分類模型，只是簡單的度量了模型的性能；對於迴歸模型，score()方法只是計算了R2報告分數。這樣的度量是很片面的，一般咱們會使用sklearn.metics和sklearn.model_selection庫下的模塊對度量模型性能。bootstrap

1、1.1 metrics評估指標

模塊提供了各類評估指標，而且用戶能夠自定義評估指標，對於metrics評估指標，主要分爲如下兩種類型：數據結構

* 以_score結尾的爲模型得分，通常狀況越大越好
* 以_error或_loss結尾的爲模型的誤差，通常狀況越小越好

接下來咱們將經過分類模型、迴歸模型來詳細講解metrics評估指標。dom

2、1.2 測試迴歸模型

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn import datasets
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

/Applications/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/Applications/anaconda3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

迴歸模型經常使用的metrics評估指標有：r2_score、explained_variance_score等機器學習

* explained_variance_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average')：迴歸方差(反應自變量與因變量之間的相關程度)
* mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')：平均絕對值偏差
* mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average')：均方差
* median_absolute_error(y_true, y_pred)：中值絕對偏差
* r2_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average')：R平方值

2.1 1.2.1 r2_socre

r2_score即報告決定係數\((R^2)\)，能夠理解成MSE的標準版，\(R^2\)的公式爲
\[ R^2 = 1-{\frac {{\frac{1}{n}\sum_{i=1}^n(y^{(i)}-\hat{y^{(i)}})^2}} {{\frac{1}{n}}\sum_{i=1}^n(y^{(i)}-\mu_{(y)})^2} } \]
其中\(\mu_{(y)}\)是\(y\)的平均值，即\({{\frac{1}{n}}\sum_{i=1}^n(y^{(i)}-\mu_{(y)})^2}\)爲\(y\)的方差，公式能夠寫成
\[ R^2 = 1-{\frac{MSE}{Var(y)}} \]
\(R^2\)的取值範圍在\(0-1\)之間，若是\(R^2=1\)，則均方偏差\(MSE=0\)，即模型完美的擬合數據。工具

# 報告決定係數得分
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

boston = datasets.load_boston()
X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X, y)
lr_predict = lr.predict(X)

lr_r2 = r2_score(y, lr_predict)
print('報告決定係數:{:.2f}'.format(lr_r2))

報告決定係數:0.74

2.2 1.2.1 explained_variance_score

# 解釋方差示例
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score

boston = datasets.load_boston()
X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X, y)
lr_predict = lr.predict(X)

ex_var = explained_variance_score(y, lr_predict)
print('解釋方差:{:.2f}'.format(ex_var))

解釋方差:0.74

3、1.3 測試分類模型

迴歸模型經常使用的metrics評估指標有：accuracy_socre、precision_score、recall_score、f1_score等性能

* accuracy_score(y_true,y_pre): 精度 
* auc(x, y, reorder=False): ROC曲線下的面積;較大的AUC表明了較好的performance。
* average_precision_score(y_true, y_score, average='macro', sample_weight=None):根據預測得分計算平均精度(AP)
* brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):越小的brier_score，模型效果越好
* confusion_matrix(y_true, y_pred, labels=None, sample_weight=None):經過計算混淆矩陣來評估分類的準確性 返回混淆矩陣
* f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None): F1值
* log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None)：對數損耗，又稱邏輯損耗或交叉熵損耗
* precision_score(y_true, y_pred, labels=None, pos_label=1, average='binary',)：查準率或者精度； precision(查準率)=TP/(TP+FP)
* recall_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None)：查全率 ；recall(查全率)=TP/(TP+FN)
* roc_auc_score(y_true, y_score, average='macro', sample_weight=None)：計算ROC曲線下的面積就是AUC的值，the larger the better
* roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)；計算ROC曲線的橫縱座標值，TPR，FPR

二分類問題中根據樣例的真實類別和模型預測類別的組合劃分爲真正例(true positive)、假正例(false positive)、真反例(true negative)、假反例(false negative)四種情形，令TP、FP、TN、FN分別表示對應的樣例數，\(樣例總數 = TP+FP+TN+FN\)。學習

TP——將正類預測爲正類數
FP——將負類預測爲正類數
TN——將負類預測爲負類數
FN——將正類預測爲負類數

偏差矩陣	-	-	-
-	-	真實值	真實值
-	-	1	0
預測值	1	True Positive(TP)	False Positive(FP)
預測值	0	True Negative(TN)	False Negative(FN)

3.1 1.3.1 準確度

準確度（accuracy_socre）定義爲
\[ P = {\frac{TP+FN}{TP+FP+TN+FN}} = \frac{正確預測的樣本數}{樣本總數} \]

# 查準率示例
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('準確度:{:.2f}'.format(
    accuracy_score(y, y_pred)))

準確度:0.97

3.2 1.3.2 查準率

查準率（precision_score）定義爲
\[ P = {\frac{TP}{TP+FP}} = \frac{正確預測爲正類的樣本數}{預測爲正類的樣本總數} \]

# 查準率示例
from sklearn import datasets
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('查準率:{:.2f}'.format(
    precision_score(y, y_pred, average='weighted')))

查準率:0.97

3.3 1.3.3 查全率

查全率（recall_score等）定義爲
\[ R = {\frac{TP}{TP+FN}} = \frac{正確預測爲正類的樣本數}{正類總樣本數} \]

# 查全率示例
from sklearn.metrics import recall_score
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('查全率:{:.2f}'.format(recall_score(y, y_pred, average='weighted')))

查全率:0.97

3.4 1.3.4 F1值

一般狀況下經過查準率和查全率度量模型的好壞，可是查準率和查全率是一對矛盾的度量工具，查準率高的時候查全率低；查全率高的時候查準率低，所以工業上對不不一樣的問題對查準率和查全率的側重點會有所不一樣。

例如癌症的預測中，正類是健康，反類是患有癌症。較高的查準率可能會致使健康的人被告知患有癌症；較高的查全率可能會致使患有癌症的患者會被告知健康。

\(F_1\)值（f1_score等）定義爲
\[ F_1 = {\frac{2*P*R}{P+R}} = {\frac{2*TP}{2TP+FP+FN}} = {\frac{2*TP}{樣例總數+TP-TN}} \]

\(F_\beta\)定義爲：
\[ F_\beta = {\frac{(1+\beta^2)*P*R}{\beta^2*P+R}} \]

\(F_\beta\)是在\(F_1\)值的基礎上加權獲得的，它能夠更好的權衡查準率和查全率。

當\(\beta<1\)時，\(P\)的權重減少，即\(R\)查準率更重要
當\(\beta=1\)時，\(F_\beta = F_1\)
當\(\beta>1\)時，\(P\)的權重增大，即\(P\)查全率更重要

# F1值示例
from sklearn import datasets
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('F1值:{:.2f}'.format(f1_score(y, y_pred, average='weighted')))

F1值:0.97

3.5 1.3.5 ROC曲線

ROC(receiver operating characteristic，ROC)曲線也能夠度量模型性能的好壞，ROC曲線顧名思義是一條曲線，它的橫軸是假正例率(false positive rate，FPR)，縱軸是真正例率(true positive rate，TPR)，假正例率和真正例率分別定義爲：
\[ FPR = {\frac{FP}{FP+TN}} \text{假正例率} \\ TPR = {\frac{TP}{TP+FN}} \text{真正例率} \]

# ROC示例
from sklearn import datasets
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data[0:100, :]
y = iris_data.target[0:100]

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
fpr, tpr, thresholds = roc_curve(y, y_pred)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.title('FPR-TPR', fontsize=20)
plt.plot(fpr, tpr)
plt.show()

3.6 1.3.6 AUC面積

因爲ROC曲線有時候沒法精準度量模型的好壞，所以會使用ROC曲線關於橫縱軸圍成的面積稱爲AUC(area under ROC curve，AUC)來度量模型的好壞，AUC值越大的模型，則模型越優。

# AUC示例
from sklearn import datasets
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data[0:100, :]
y = iris_data.target[0:100]

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
# 計算AUC值
print('AUC值:{:.2f}'.format(roc_auc_score(y, y_pred, average='weighted')))

AUC值:1.00

4、1.4 欠擬合和過擬合

第二部分講解線性迴歸時講到，\(0\)偏差的模型也許並非最好的，由於模型是經過訓練集獲得的，因爲訓練集可能存在噪聲，所以訓練集並不必定能表明測試集，更不必定能表明將來新數據。雖然這樣的模型可能很好的擬合訓練數據，可是對將來數據可能並無較好的擬合能力，這種現象成爲過擬合。

# 過擬合圖例
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')
%matplotlib inline

# 自定義數據並處理數據
data_frame = {'x': [2, 1.5, 3, 3.2, 4.22, 5.2, 6, 6.7],
              'y': [0.5, 3.5, 5.5, 5.2, 5.5, 5.7, 5.5, 6.25]}
df = pd.DataFrame(data_frame)
X, y = df.iloc[:, 0].values.reshape(-1, 1), df.iloc[:, 1].values.reshape(-1, 1)

# 線性迴歸
lr = LinearRegression()
lr.fit(X, y)


def poly_lr(degree):
    """多項式迴歸"""
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    lr_poly = LinearRegression()
    lr_poly.fit(X_poly, y)
    y_pred_poly = lr_poly.predict(X_poly)

    return y_pred_poly


def plot_lr():
    """對線性迴歸生成的圖線畫圖"""
    plt.scatter(X, y, c='k', edgecolors='white', s=50)
    plt.plot(X, lr.predict(X), color='r', label='lr')
    # 噪聲
    plt.scatter(2, 0.5, c='r')
    plt.text(2, 0.5, s='$(2,0.5)$')

    plt.xlim(0, 7)
    plt.ylim(0, 8)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()


def plot_poly(degree, color):
    """對多項式迴歸生成的圖線畫圖"""
    plt.scatter(X, y, c='k', edgecolors='white', s=50)
    plt.plot(X, poly_lr(degree), color=color, label='m={}'.format(degree))
    # 噪聲
    plt.scatter(2, 0.5, c='r')
    plt.text(2, 0.5, s='$(2,0.5)$')

    plt.xlim(0, 7)
    plt.ylim(0, 8)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()


def run():
    plt.figure()
    plt.subplot(231)
    plt.title('圖1(線性迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_lr()
    plt.subplot(232)
    plt.title('圖2(一階多項式迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_poly(1, 'orange')
    plt.subplot(233)
    plt.title('圖3(三階多項式迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_poly(3, 'gold')
    plt.subplot(234)
    plt.title('圖4(五階多項式迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_poly(5, 'green')
    plt.subplot(235)
    plt.title('圖5(七階多項式迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_poly(7, 'blue')
    plt.subplot(236)
    plt.title('圖6(十階多項式迴歸)', fontproperties=font, color='r', fontsize=12)
    plot_poly(10, 'violet')
    plt.show()


run()

如上圖所示每張圖都有相同分佈的8個樣本點，紅點明顯是一個噪聲點，接下來將講解上述8張圖。暫時不用太關心線性迴歸和多項式迴歸是什麼，這兩個之後你都會學習到，此處引用只是爲了方便舉例。

圖1：線性迴歸擬合樣本點，能夠發現樣本點距離擬合曲線很遠，這個時候通常稱做欠擬合（underfitting）
圖2：一階多項式迴歸擬合樣本點，等同於線性迴歸
圖3：三階多項式迴歸擬合樣本點，表現還不錯
圖4：五階多項式迴歸擬合樣本點，明顯過擬合
圖5：七階多項式迴歸擬合樣本點，已經擬合了全部的樣本點，毋庸置疑的過擬合
圖7：十階多項式迴歸擬合樣本點，擬合樣本點的曲線和七階多項式已經沒有了區別，能夠想象十階以後的曲線也相似於七階多項式的擬合曲線

從上圖能夠看出，過擬合模型將會變得複雜，對於線性迴歸而言，它可能須要更高階的多項式去擬合樣本點，對於其餘機器學習算法，也是如此。這個時候你也能夠想象，過擬合雖然對擬合的樣本點的偏差接近0，可是對於將來新數據而言，若是新數據的\(x=2\)，若是使用過擬合的曲線進行擬合新數據，那麼會給出\(y=0.5\)的預測值，也就是說把噪聲的值給了新數據，這樣明顯是不合理的。

4.1 4.9.4 交叉驗證

對訓練數據集切割作交叉驗證也是防止模型過擬合的一個很好的方法。

通常會把數據按照某種比例分爲訓練集、測試集。訓練集用來訓練模型，把測試集當作將來新樣本的樣本集用來評估模型。而後交叉驗證能夠認爲就是不斷地重複訓練模型、測試模型。若是數據量較大的話，會把訓練集按照某種比例分紅訓練集、驗證集、測試集，使用訓練集訓練參數；使用驗證集訓練超參數；使用測試集測試模型性能。

4.1.1 4.9.4.1 簡單交叉驗證

把數據集按照某種比例，將數據集中的數據隨機的分爲訓練集和測試集。而後不斷的改變模型參數訓練出一組模型，每訓練完一個模型就用測試集測試，最後獲得性能最好的模型。

初始值\(c=1\)
訓練模型
測試模型，\(c+1\)
若是\(c<11\)改變模型參數，跳轉到步驟1；反之，中止訓練
從模型集\(\{c_1,c_2,\cdots,c_{10}\}\)中選擇性能最優的模型

# 簡單交叉驗證
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

# 導入鳶尾花數據
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# random_state=1能夠確保結果不隨機，stratify=y能夠確保每一個分類的結果都有相同的比例
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

print('不一樣類別全部樣本數量:{}'.format(np.bincount(y)))
print('不一樣類別訓練數據數量:{}'.format(np.bincount(y_train)))
print('不一樣類別測試數據數量:{}'.format(np.bincount(y_test)))

不一樣類別全部樣本數量:[50 50 50]
不一樣類別訓練數據數量:[35 35 35]
不一樣類別測試數據數量:[15 15 15]

4.1.2 4.9.4.2 分層k折交叉驗證

將數據隨機的分爲\(k\)個子集（\(k\)的取值範圍通常在\([1-20]\)之間），而後取出\(k-1\)個子集進行訓練，另外一個子集用做測試模型，重複\(k\)次這個過程，獲得最優模型。

將數據分爲\(k\)個子集
選擇\(k-1\)個子集訓練模型
選擇另外一個子集測試模型
重複2-3步，直至有\(k\)個模型
選擇\(k\)個模型中性能最優的模型

# k折交叉驗證
import numpy as np
from sklearn import datasets
# StratifiedKFold會按照原有標籤的分佈狀況對數據分層
from sklearn.model_selection import StratifiedKFold

# 導入鳶尾花數據
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# n_splits=10至關於k=10
kfold = StratifiedKFold(n_splits=3, random_state=1)
kfold = kfold.split(X, y)

for k, (train_data, test_data) in enumerate(kfold):
    print(train_data,test_data)
    print('迭代次數:{}'.format(k), '訓練數據長度:{}'.format(
        len(train_data)), '測試數據長度:{}'.format(len(test_data)))

[ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149] [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116]
迭代次數:0 訓練數據長度:99 測試數據長度:51
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  34
  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
  53  54  55  56  57  58  59  60  61  62  63  64  65  66  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105
 106 107 108 109 110 111 112 113 114 115 116 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149] [ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  67
  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133]
迭代次數:1 訓練數據長度:99 測試數據長度:51
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  50  51
  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83 100 101 102 103
 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
 122 123 124 125 126 127 128 129 130 131 132 133] [ 34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  84  85
  86  87  88  89  90  91  92  93  94  95  96  97  98  99 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]
迭代次數:2 訓練數據長度:102 測試數據長度:48

4.1.3 4.9.4.3 隨機排列交叉驗證

# k折交叉驗證
import numpy as np
from sklearn import datasets
# StratifiedKFold會按照原有標籤的分佈狀況對數據分層
from sklearn.model_selection import ShuffleSplit

# 導入鳶尾花數據
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# n_splits=10至關於k=10
kfold = ShuffleSplit(n_splits=3, random_state=1)
kfold = kfold.split(X)

for k, (train_data, test_data) in enumerate(kfold):
    print(train_data,test_data)
    print('迭代次數:{}'.format(k), '訓練數據長度:{}'.format(
        len(train_data)), '測試數據長度:{}'.format(len(test_data)))

[ 42  92  66  31  35  90  84  77  40 125  99  33  19  73 146  91 135  69
 128 114  48  53  28  54 108 112  17 119 103  58 118  18   4  45  59  39
  36 117 139 107 132 126  85 122  95  11 113 123  12   2 104   6 127 110
  65  55 144 138  46  62  74 116  93 100  89  10  34  32 124  38  83 111
 149  27  23  67   9 130  97 105 145  87 148 109  64  15  82  41  80  52
  26  76  43  24 136 121 143  49  21  70   3 142  30 147 106  47 115  13
  88   8  81  60   0   1  57  22  61  63   7  86  96  68  50 101  20  25
 134  71 129  79 133 137  72 140  37] [ 14  98  75  16 131  56 141  44  29 120  94   5 102  51  78]
迭代次數:0 訓練數據長度:135 測試數據長度:15
[ 18  37  59 111  65 119 127 102 121 118  90 146   3  51 100 133 105  23
  57 123  49   9  72 126 124 145  68 143   6  13 120  89 135  22  99  92
 130  39  58  81  52 117   4  17 138  97  70 109 148  42  73 115   5  76
  38  86 122  80  95  34  60 129 112   7  26  19  14  30  15  44  20 137
 107  64  41  79  50 131 108 144 104   8  74  94 103  31  82  55 125  32
  54  48  83 149   2  33  93 136  35  75  63  29   0  46  78  66 140  67
 128 106  28  16  87  45  47 113  77  40  21 101  69  53  24 134  43 116
 141 142  25 147  56  61  96  10  84] [132   1 114  62 110  27  91  36  85  98  88  11 139  71  12]
迭代次數:1 訓練數據長度:135 測試數據長度:15
[ 62 135  20  56  77  55  65  87   5  97 117  10 142  74  17  12  45 102
  50  96 124  48   8  47 122 148  29 130  71 147   7 128 104  91 140  79
  60 136  86  67  33  68   0 129  49 121  99  32  59 110 101  14   6 123
 108  37 107 111  21  26  42  58  75  78  90 145 139  63  38  18  40 119
 100 126 134  28  72 144  80  46 113 149  85   2  81 116  35 115 138 137
  16 125 105  11 120 141  76  93 109  88  57  41   9  53  95 106  92  66
  22  23  36  13 132  61  83  39  70 131 146  98  64 103  30  84  94 127
  82   1  43  27  89  52  73  69 112] [ 51 133  19  31  24  34 114  54 143   4  15  25 118   3  44]
迭代次數:2 訓練數據長度:135 測試數據長度:15

4.1.4 4.9.4.3 留一法交叉驗證

與\(k\)折交叉驗證相似，屬於\(k\)折交叉驗證的特例，即一個數據集\(T\)中有\(n\)個數據，當\(k=n-1\)時，\(k\)折交叉驗證即爲留一法交叉驗證。

# 留一法交叉驗證
import numpy as np
from sklearn import datasets
from sklearn.model_selection import LeaveOneOut

# 導入鳶尾花數據
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

loo = LeaveOneOut()
loo

LeaveOneOut()

loo.get_n_splits(X)

count = 0
for train_index, test_index in loo.split(X):
    if count < 10:
        print("訓練集長度:", len(train_index), "測試集長度:", len(test_index))
    count += 1
    if count == loo.get_n_splits(X)-1:
        print('...\n迭代次數:', count)

訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
訓練集長度: 149 測試集長度: 1
...
迭代次數: 149

4.1.5 4.9.4.4 時間序列分割

時間序列分割通常對時間序列算法作測試，他切割的原理是：測試集的數據和上幾個數據會有必定的聯繫。

from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [2, 4], [3, 2], [2, 4], [1, 2], [3, 2]])
y = np.array([1, 3, 3, 4, 5, 4])
# max_train_size指訓練數據個數，n_splits指切割次數
tscv = TimeSeriesSplit(n_splits=5, max_train_size=3)
tscv

TimeSeriesSplit(max_train_size=3, n_splits=5)

for train_index, test_index in tscv.split(X):
    print("訓練數據索引:", train_index, "測試數索引:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

訓練數據索引: [0] 測試數索引: [1]
訓練數據索引: [0 1] 測試數索引: [2]
訓練數據索引: [0 1 2] 測試數索引: [3]
訓練數據索引: [1 2 3] 測試數索引: [4]
訓練數據索引: [2 3 4] 測試數索引: [5]

5、1.5 交叉驗證和模型一塊兒使用

若是隻是對交叉驗證有必定的瞭解，那麼問題則是，咱們如何把使用交叉驗證的思想，訓練模型呢？使用for循環嗎？不，咱們可使用sklearn自帶的交叉驗證評分方法。

5.1 1.5.1 cross_val_score

交叉驗證中的cross_val_score，即最普通的交叉驗證和模型一塊兒使用的方法，該方法須要指定模型、訓練集數據和評分方法，而後能夠得出每一次測試模型的分數。

from sklearn.metrics import SCORERS

# 可使用的評分方法
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted'])

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
scores

array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])

print('準確率:{:.4f}(+/-{:.4f})'.format(scores.mean(), scores.std()*2))

準確率:0.9733(+/-0.0653)

5.2 1.5.2 cross_validate

交叉驗證中cross_validate方法，相比較cross_val_score方法能夠指定多個指標，而且cross_validate方法會返回模型fit_time訓練和score_time評分的時間。

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
cross_validate(clf, X, y, cv=10, scoring=[
    'accuracy', 'recall_weighted'], return_train_score=True)

{'fit_time': array([0.04038572, 0.06277108, 0.07863808, 0.03404975, 0.03079391,
        0.04499412, 0.04462409, 0.06048512, 0.05675983, 0.03511214]),
 'score_time': array([0.00144005, 0.00148797, 0.00143886, 0.00105596, 0.00098372,
        0.00138307, 0.00099993, 0.00111103, 0.0020051 , 0.00080705]),
 'test_accuracy': array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
        0.93333333, 0.93333333, 1.        , 1.        , 1.        ]),
 'train_accuracy': array([0.97037037, 0.97777778, 0.97037037, 0.97037037, 0.97777778,
        0.97777778, 0.98518519, 0.97037037, 0.97037037, 0.97777778]),
 'test_recall_weighted': array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
        0.93333333, 0.93333333, 1.        , 1.        , 1.        ]),
 'train_recall_weighted': array([0.97037037, 0.97777778, 0.97037037, 0.97037037, 0.97777778,
        0.97777778, 0.98518519, 0.97037037, 0.97037037, 0.97777778])}

5.3 1.5.3 cross_val_predict

交叉驗證中的cross_val_predict方法能夠獲取每一個樣本的預測結果，即每個樣本都會被做爲測試數據。

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
per_sample = cross_val_predict(clf, X, y, cv=10)
per_sample

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

from sklearn.metrics import accuracy_score

accuracy_score(y, per_sample)

0.9733333333333334

6、1.6 模型特定交叉驗證

sklearn構建的內部自帶交叉驗證優化的估計器，如LassoCV、RidgeCV等。

from sklearn import datasets
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso, LassoCV

boston = datasets.load_boston()
X = boston.data
y = boston.target

reg = Lasso()
reg.fit(X, y)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

y_pred = reg.predict(X)

'報告決定係數:{:.2f}'.format(r2_score(y, y_pred))

'報告決定係數:0.68'

reg = LassoCV(cv=5)
reg.fit(X, y)

LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
    positive=False, precompute='auto', random_state=None,
    selection='cyclic', tol=0.0001, verbose=False)

y_pred = reg.predict(X)

'報告決定係數:{:.2f}'.format(r2_score(y, y_pred))

'報告決定係數:0.70'

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。