sklearn中的數據集的劃分

時間 2019-11-06

標籤 sklearn 數據劃分简体版

原文原文鏈接

sklearn數據集劃分方法有以下方法：dom

KFold，GroupKFold，StratifiedKFold，LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut，ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplit，PredefinedSplit，TimeSeriesSplit，函數

①數據集劃分方法——K折交叉驗證：KFold，GroupKFold，StratifiedKFold，性能

將所有訓練集S分紅k個不相交的子集，假設S中的訓練樣例個數爲m，那麼每個本身有m/k個訓練樣例，相應的子集爲{s₁，s₂，...，s_k}
每次從分好的子集裏面，拿出一個做爲測試集，其餘k-1個做爲訓練集
在k-1個訓練集上訓練出學習器模型
把這個模型放到測試集上，獲得分類率的平均值，做爲該模型或者假設函數的真實分類率

這個方法充分利用了因此樣本，但計算比較繁瑣，須要訓練k次，測試k次學習

KFold：測試

import numpy as np #KFold from sklearn.model_selection import KFold X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,3,4,5,6]) kf=KFold(n_splits=2) #分紅幾個組 kf.get_n_splits(X) print(kf)

for train_index,test_index in kf.split(X):
 print("Train Index:",train_index,",Test Index:",test_index)
 X_train,X_test=X[train_index],X[test_index]
 y_train,y_test=y[train_index],y[test_index]
 #print(X_train,X_test,y_train,y_test)

#KFold(n_splits=2, random_state=None, shuffle=False) #Train Index: [3 4 5] ,Test Index: [0 1 2] #Train Index: [0 1 2] ,Test Index: [3 4 5]

GroupKFold：

import numpy as np from sklearn.model_selection import GroupKFold X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,3,4,5,6]) groups=np.array([1,2,3,4,5,6]) group_kfold=GroupKFold(n_splits=2) group_kfold.get_n_splits(X,y,groups) print(group_kfold) for train_index,test_index in group_kfold.split(X,y,groups): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test) #GroupKFold(n_splits=2) #Train Index: [0 2 4] ,Test Index: [1 3 5] #Train Index: [1 3 5] ,Test Index: [0 2 4]

StratifiedKFold：保證訓練集中每一類的比例是相同的

import numpy as np from sklearn.model_selection import StratifiedKFold X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,1,1,2,2,2]) skf=StratifiedKFold(n_splits=3) skf.get_n_splits(X,y) print(skf) for train_index,test_index in skf.split(X,y): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test)

#StratifiedKFold(n_splits=3, random_state=None, shuffle=False)
#Train Index: [1 2 4 5] ,Test Index: [0 3]
#Train Index: [0 2 3 5] ,Test Index: [1 4]
#Train Index: [0 1 3 4] ,Test Index: [2 5]

②數據集劃分方法——留一法：LeaveOneGroupOut，LeavePGroupsOut，LeaveOneOut，LeavePOut，spa

留一法驗證（Leave-one-out，LOO）：假設有N個樣本，將每個樣本做爲測試樣本，其餘N-1個樣本做爲訓練樣本，這樣獲得N個分類器，N個測試結果，用這N個結果的平均值來衡量模型的性能
若是LOO與K-fold CV比較，LOO在N個樣本上創建N個模型而不是k個，更進一步，N個模型的每個都是在N-1個樣本上訓練的，而不是（k-1）*n/k。兩種方法中，假定k不是很大並且k<<N，LOO比k-fold CV更耗時
留P法驗證（Leave-p-out）：有N個樣本，將每P個樣本做爲測試樣本，其它N-P個樣本做爲訓練樣本，這樣獲得個train-test pairs，不像LeaveOneOut和KFold，當P>1時，測試集將會發生重疊，當P=1的時候，就變成了留一法

leaveOneOut：測試集就留下一個
code

import numpy as np from sklearn.model_selection import LeaveOneOut X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,3,4,5,6]) loo=LeaveOneOut() loo.get_n_splits(X) print(loo) for train_index,test_index in loo.split(X,y): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test)
#LeaveOneOut()
#Train Index: [1 2 3 4 5] ,Test Index: [0]
#Train Index: [0 2 3 4 5] ,Test Index: [1]
#Train Index: [0 1 3 4 5] ,Test Index: [2]
#Train Index: [0 1 2 4 5] ,Test Index: [3]
#Train Index: [0 1 2 3 5] ,Test Index: [4]
#Train Index: [0 1 2 3 4] ,Test Index: [5

LeavePOut：測試集留下P個

import numpy as np from sklearn.model_selection import LeavePOut X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,3,4,5,6]) lpo=LeavePOut(p=3) lpo.get_n_splits(X) print(lpo) for train_index,test_index in lpo.split(X,y): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test) #LeavePOut(p=3) #Train Index: [3 4 5] ,Test Index: [0 1 2] #Train Index: [2 4 5] ,Test Index: [0 1 3] #Train Index: [2 3 5] ,Test Index: [0 1 4] #Train Index: [2 3 4] ,Test Index: [0 1 5] #Train Index: [1 4 5] ,Test Index: [0 2 3] #Train Index: [1 3 5] ,Test Index: [0 2 4] #Train Index: [1 3 4] ,Test Index: [0 2 5] #Train Index: [1 2 5] ,Test Index: [0 3 4] #Train Index: [1 2 4] ,Test Index: [0 3 5] #Train Index: [1 2 3] ,Test Index: [0 4 5] #Train Index: [0 4 5] ,Test Index: [1 2 3] #Train Index: [0 3 5] ,Test Index: [1 2 4] #Train Index: [0 3 4] ,Test Index: [1 2 5] #Train Index: [0 2 5] ,Test Index: [1 3 4] #Train Index: [0 2 4] ,Test Index: [1 3 5] #Train Index: [0 2 3] ,Test Index: [1 4 5] #Train Index: [0 1 5] ,Test Index: [2 3 4] #Train Index: [0 1 4] ,Test Index: [2 3 5] #Train Index: [0 1 3] ,Test Index: [2 4 5] #Train Index: [0 1 2] ,Test Index: [3 4 5]

③數據集劃分方法——隨機劃分法：ShuffleSplit，GroupShuffleSplit，StratifiedShuffleSplitblog

ShuffleSplit迭代器產生指定數量的獨立的train/test數據集劃分，首先對樣本全體隨機打亂，而後再劃分出train/test對，可使用隨機數種子random_state來控制數字序列發生器使得訊算結果可重現
ShuffleSplit是KFlod交叉驗證的比較好的替代，他容許更好的控制迭代次數和train/test的樣本比例
StratifiedShuffleSplit和ShuffleSplit的一個變體，返回分層劃分，也就是在建立劃分的時候要保證每個劃分中類的樣本比例與總體數據集中的原始比例保持一致

#ShuffleSplit 把數據集打亂順序，而後劃分測試集和訓練集，訓練集額和測試集的比例隨機選定，訓練集和測試集的比例的和能夠小於1get

import numpy as np from sklearn.model_selection import ShuffleSplit X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,3,4,5,6]) rs=ShuffleSplit(n_splits=3,test_size=.25,random_state=0) rs.get_n_splits(X) print(rs) for train_index,test_index in rs.split(X,y): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test) print("==============================") rs=ShuffleSplit(n_splits=3,train_size=.5,test_size=.25,random_state=0) rs.get_n_splits(X) print(rs) for train_index,test_index in rs.split(X,y): print("Train Index:",train_index,",Test Index:",test_index)

#ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=None)
#Train Index: [1 3 0 4] ,Test Index: [5 2]
#Train Index: [4 0 2 5] ,Test Index: [1 3]
#Train Index: [1 2 4 0] ,Test Index: [3 5]
#==============================
#ShuffleSplit(n_splits=3, random_state=0, test_size=0.25, train_size=0.5)
#Train Index: [1 3 0] ,Test Index: [5 2]
#Train Index: [4 0 2] ,Test Index: [1 3]
#Train Index: [1 2 4] ,Test Index: [3 5]

#StratifiedShuffleSplitShuffleSplit 把數據集打亂順序，而後劃分測試集和訓練集，訓練集額和測試集的比例隨機選定，訓練集和測試集的比例的和能夠小於1,可是還要保證訓練集中各種所佔的比例是同樣的it

import numpy as np from sklearn.model_selection import StratifiedShuffleSplit X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y=np.array([1,2,1,2,1,2]) sss=StratifiedShuffleSplit(n_splits=3,test_size=.5,random_state=0) sss.get_n_splits(X,y) print(sss) for train_index,test_index in sss.split(X,y): print("Train Index:",train_index,",Test Index:",test_index) X_train,X_test=X[train_index],X[test_index] y_train,y_test=y[train_index],y[test_index] #print(X_train,X_test,y_train,y_test) #StratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,train_size=None) #Train Index: [5 4 1] ,Test Index: [3 2 0] #Train Index: [5 2 3] ,Test Index: [0 4 1] #Train Index: [5 0 4] ,Test Index: [3 1 2]

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。