- 使用Keras進行自動驗證
- 使用Keras進行手工驗證
- 使用Keras進行K折交叉驗證
1 分割數據數組
數據量大和網絡複雜會形成訓練時間很長,因此須要將數據分紅訓練、測試或驗證數據集。Keras提供兩種辦法:網絡
- 自動驗證
- 手工驗證
Keras能夠將數據自動分出一部分,每次訓練後進行驗證。在訓練時用validation_split
參數能夠指定驗證數據的比例,通常是總數據的20%或者33%。app
1 下面的代碼加入了自動驗證:dom
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # MLP with automatic validation set from keras.models import Sequential from keras.layers import Dense import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu')) model.add(Dense(8, kernel_initializer='uniform', activation='relu')) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10)
訓練時,每輪會顯示訓練和測試數據的數據:機器學習
2 手工驗證函數
Keras也能夠手工進行驗證。咱們定義一個train_test_split
函數,將數據分紅2:1的測試和驗證數據集。在調用fit()
方法時須要加入validation_data
參數做爲驗證數據,數組的項目分別是輸入和輸出數據。性能
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # MLP with manual validation set from keras.models import Sequential from keras.layers import Dense # from sklearn.cross_validation import train_test_split # 因爲cross_validation將要被移除This module will be removed in 0.20.因此使用下面的model_selection來導入train_test_split from sklearn.model_selection import train_test_split import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter="\t") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # split into 67% for train and 33% for test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed) # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu')) model.add(Dense(8, kernel_initializer='uniform', activation='relu')) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10)
和自動化驗證同樣,每輪訓練後,Keras會輸出訓練和驗證結果:學習
3 手工K折交叉驗證測試
機器學習的金科玉律是K折驗證,以驗證模型對將來數據的預測能力。K折驗證的方法是:將數據分紅K組,留下1組驗證,其餘數據用做訓練,直到每種分發的性能一致。lua
深度學習通常不用交叉驗證,由於對算力要求過高。例如,K折的次數通常是5或者10折:每組都須要訓練並驗證,訓練時間成倍上升。然而,若是數據量小,交叉驗證的效果更好,偏差更小。
scikit-learn有StratifiedKFold
類,咱們用它把數據分紅10組。抽樣方法是分層抽樣,儘量保證每組數據量一致。而後咱們在每組上訓練模型,使用verbose=0
參數關閉每輪的輸出。
訓練後,Keras會輸出模型的性能,並存儲模型。最終,Keras輸出性能的平均值和標準差,爲性能估算提供更準確的估計:
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # MLP for Pima Indians Dataset with 10-fold cross validation from keras.models import Sequential from keras.layers import Dense from sklearn.cross_validation import StratifiedKFold #from sklearn.model_selection import StratifiedKFold import numpy # fix random seed for reproducibility seed = 7 numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter="\t") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # define 10-fold cross validation test harness kfold = StratifiedKFold(y=Y,n_folds=10, shuffle=True, random_state=seed) cvscores = [] for i, (train, test) in enumerate(kfold): # create model model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu')) model.add(Dense(8, kernel_initializer='uniform', activation='relu')) model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0) # evaluate the model scores = model.evaluate(X[test], Y[test], verbose=0) print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) cvscores.append(scores[1] * 100) print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
輸出是: