K 折交叉驗證:關於StratifiedKFold 與 KFold 的區別與聯繫

訓練神經網絡時的關鍵一步時要評估模型的泛化能力,一個模型若是性能很差,要麼是由於模型過於複雜致使過擬合(高方差),要麼是模型過於簡單致使致使欠擬合(高誤差)。要解決這一問題,咱們就要學會兩種交叉驗證計數——holdout交叉驗證和k折交叉驗證, 來評估模型的泛化能力。html

計算K折 交叉驗證結果的平均值做爲參數來評估模型,故而使用k折交叉驗證來尋找最優參數要比holdout方法更穩定。一旦咱們找到最優參數,要使用這組參數在原始數據集上訓練模型做爲最終的模型。python

1 KFold概念釋義

class sklearn.model_selection.KFold(n_splits=’warn’, shuffle=False, random_state=None)

將數據集拆分爲 K 個連續的摺疊(默認狀況下不進行混洗),返回被拆分後各數據集的索引indices。git

而後將每一個摺疊用做一次驗證,而剩餘 k-1 個摺疊看成訓練集。網絡

參數:dom

n_splits:int,默認爲 5(V0.22版本以前默認爲3),但折數至少爲 2.機器學習

shuffle:可選,默認False,即拆分前不對數據進行混洗。函數

random_state:可選,默認爲None,若是爲int,則random_state是隨機數生成器使用的種子。oop

鏈接:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfoldpost

該class 中的一個重要函數:性能

split(self, X, y=None, groups=None)

 

生成索引以將數據分紅訓練集和測試集

Parameters:
X  array-like, shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

y  array-like, shape (n_samples,)

The target variable for supervised learning problems.

groups  array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into train/test set.

Yields:
train  ndarray

The training set indices for that split.

test  ndarray

The testing set indices for that split.

2 StratifiedKFold 概念釋義 

class sklearn.model_selection.StratifiedKFold(n_splits=’warn’, shuffle=False, random_state=None)

分層K 折交叉驗證器,實現分層採樣交叉切分。

將數據集拆分爲 K 個連續的摺疊(默認狀況下不進行混洗),返回被拆分後各數據集的索引indices。

此交叉驗證對象是 KFold 的變體,它返回分層的摺疊;摺疊是經過保留每一個類別的樣品百分比來進行的。這樣能夠保證被分的各個子數據集中的各種比例與原數據集比例一致。

實際上,

參數:

n_splits:int,默認爲 5(V0.22版本以前默認爲3),但折數至少爲 2.

shuffle:可選,默認False,即拆分前不對數據進行混洗。

random_state:可選,默認爲None,若是爲int,則random_state是隨機數生成器使用的種子。

實際上,實例化時的參數是同樣的,不同的是二者中 split 函數的使用內容。

split(self,X,y,groups = None )

把數據分紅 訓練集和測試集索引 indices

參數:

X:array-like, shape (n_samples, n_features)

 訓練數據,其中n_samples是樣本數,n_features是特徵數。

y:array-like, shape (n_samples,)

 有關監督學習問題的目標變量;分層是基於Y標籤進行的。

groups:object

 Always ignored, exists for compatibility.

Yields:
train:ndarray,被分割的訓練數據集索引indices

test:ndarray,被分割的測試集索引indices

GroupKFold 

sklearn.model_selection.GroupKFold

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold

4 實例分析

import numpy as np
import string
from sklearn.model_selection import KFold, StratifiedKFold

# 建立訓練集
train_num = np.array(range(10)).reshape(5, 2)
# 建立label集合
label_str = np.array(["a", "b", "a", "a", "b"])
# print("train_num:\n", train_num)
# print("label_str:\n", label_str)
# 實例化
sfk = StratifiedKFold(n_splits=3)
for train_index, test_index in sfk.split(train_num, label_str):
    new_train_num, new_test_num = train_num[train_index], train_num[test_index]
    new_label_str, new_label_str = label_str[train_index], label_str[test_index]

    print('train set:{} \n'.format(new_train_num))
    print('train label set:{} \n'.format(new_label_str))
    print('test set:{} \n'.format(new_test_num))
    print('test label set:{} \n'.format(new_label_str))

運行

train set:[[4 5]
 [6 7]
 [8 9]] 
train label set:['a' 'a' 'b'] 
test set:[[0 1]
 [2 3]] 
test label set:['a' 'b'] 
train set:[[0 1]
 [2 3]
 [6 7]] 
train label set:['a' 'b' 'a'] 
test set:[[4 5]
 [8 9]] 
test label set:['a' 'b'] 
train set:[[0 1]
 [2 3]
 [4 5]
 [8 9]] 
train label set:['a' 'b' 'a' 'b'] 
test set:[[6 7]] 
test label set:['a'] 

這種條件下運行是能夠的。

將代碼中的 StratifiedKFold 用 KFold 代替,效果相似,可是在 KFold 語法中默認 y = None,因此當使用分層採用交叉驗證時,仍是用 StratifiedKFold 類爲好。

該段代碼中使用的 n_splits=3 ,使得 StratifiedKFold 與 KFold 效果相似,可是,將 n_splits=4 時,KFold 可運行並輸出結果,而 StratifiedKFold 卻報錯

ValueError: n_splits=4 cannot be greater than the number of members in each class.

從這一點來說,分層採樣交叉驗證時仍是老老實實地使用 StratifiedKFold 類。

KFold 類的常規使用以下:

import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold

# 建立訓練集
train_num = np.array(range(10)).reshape(5, 2)
# print("train_num:\n", train_num)
#
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(train_num):
    new_train_num, new_test_num = train_num[train_index], train_num[test_index]

 我應用的部分實例片斷

# 模型的編譯
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy']
              )

# 將驗證集從train中分離出來
# 對數據創建5折交叉驗證的劃分,K分層交叉採樣類實例化
sfd = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
for train_index, val_index in sfd.split(train_images, train_labels):
    print("train_images.shape:", train_images.shape)
    new_train_images, new_val_images = train_images[train_index], train_images[val_index]
    new_train_labels, new_val_labels = train_labels[train_index], train_labels[val_index]
    print("new_train_images.shape:", new_train_images.shape)

    # 模型的訓練
    history = model.fit(new_train_images, new_train_labels,
                        epochs=100,
                        # batch_size=512,
                        validation_data=(new_val_images, new_val_labels),
                        callbacks=[visualization.model_point, visualization.model_stop]
                        )
# 繪製出訓練過程當中的損失值loss和度量值accuracy
visualization.plot_history(history)

 

 

5 train_test_split

若是不使用K交叉驗證,則直接使用 train_test_split 函數便可。直接分離。

from sklearn.model_selection import train_test_split
train_images, val_images, train_labels, val_labels = \
    train_test_split(train_images, train_labels,
                     test_size=val_train_ratio,
                     )

 

相關參考

sklearn.model_selection.StratifiedKFold

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

 

sklearn.model_selection.KFold

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

 

sklearn.model_selection.GroupKFold

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold

 

『Sklearn』數據劃分方法

http://www.javashuo.com/article/p-wdurtplz-de.html

 

sklearn中的數據集的劃分

http://www.javashuo.com/article/p-xqibacgy-dn.html

 

談一談二分類比賽中經常使用的KFold, StratifiedKFold K折交叉切分

http://www.luyixian.cn/news_show_24013.aspx

 

Sklearn中的f1_score與StratifiedKFold

https://www.jianshu.com/p/4b9f359b4898


StratifiedKFold和Kfold的區別

https://blog.csdn.net/zhangbaoanhadoop/article/details/79559011

StratifiedKFold 和 KFold 的比較

https://www.jianshu.com/p/c84818b56fa0

TensorFlow系列專題(二):機器學習基礎

http://blog.itpub.net/31555081/viewspace-2218763/

K折交叉驗證評估模型性能

https://ljalphabeta.gitbooks.io/python-/content/kfold.html

相關文章
相關標籤/搜索