Python機器學習包的sklearn中的Gridsearch簡單使用

Python機器學習包的sklearn中的Gridsearch簡單使用
摘要：cross-validation(交叉驗證)Asolutiontothisproblemisaprocedurecalledcross-validation(CVforshort).Atestsetshouldstillbeheldoutforfinalevaluation,butthevalidationsetisnolongerneededwhendoingCV.Inthebasicapproach,calledk-foldCV,thetrainingsetissplit
cross-validation(交叉驗證)

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k」folds」:html

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).git

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.數組

上面這段話是引自sklearn的document中,對於cv的描述.描述了一個在交叉驗證中的相同的規則就是,在解決實際問題中,咱們能夠將全部的數據集dataset,劃分爲train_set(例如70%)和test_set(30%),而後在train_set上作cross_validation,最後取平均以後,再使用test_set測試模型的準確度.不是直接在dataset上直接作cross-validation(這個是我理解cv中的一個誤區)app
k-fold
原本不想寫關於cross-validation的內容的,可是決定這裏面本身的誤區仍是不少的,因此寫一下,若是有人看到了,也能夠幫忙指出來.dom

1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop機器學習

前提:整個數據集被分紅了訓練集D(70%)和測試集T(30%).函數

上面這段話就是k-fold的全過程,(此時只涉及到訓練集D)
1.將整個訓練集D分爲k個等大的集合,而後選出k-1個做爲模型的訓練集.訓練模型model1.
2.使用剩下的一個集合 Di ,做爲驗證集(和所謂的測試集的做用是同樣的),測試model1的準確性.關於模型評估方法,能夠參考sklearn實現的一些方法.
3.循環執行上述過程k次,保證沒有重複.而後對於準確性求平均值,這就是該分類方法對應的正確性.
有人可能會問平均出來的正確性對應的模型權值 θ 是哪個?這個問題就須要明白機器學習的目的是什麼?機器學習不是找到所謂模型對應的權值是多少,而是相對於實際問題,選出合適的模型(好比向量機模型)和合適的超參(好比核函數,c等超參).上述的平均正確率就是對應於模型+超參的.oop
GridSearch
搞懂了K-fold,就能夠聊一聊GridSearch啦,由於GridSearch默認參數就是3-fold的,若是沒有不懂cross-validation就很難理解這個.學習
想幹什麼
Gridsearch是爲了解決調參的問題.好比向量機SVM的經常使用參數有kernel,gamma,C等,手動調的話太慢了,寫循環也只能順序運行,不能並行.因而就出現了Gridsearch.經過它,能夠直接找出最優的參數.測試
怎麼調參
param字典類型,它會將每一個字典類型裏的字段全部的組合都輸入到分類器中執行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] 如何評估
參數輸入以後,須要評估每組參數對應的模型的預測能力.Gridsearch就在數據集上作k-fold,而後求出每組參數對應模型的平均精確度.選出最優的參數.返回.
通常Gridsearch只在訓練集上作k-fold並不會使用測試集.而是將測試集留在最後,當gridsearch選出最佳模型的時候,在使用測試集測試模型的泛化能力.

貼一個sklearn上面的例子
from sklearn import datasetsfrom sklearn.cross_validation import train_test_splitfrom sklearn.grid_search import GridSearchCVfrom sklearn.metrics import classification_reportfrom sklearn.svm import SVC# Loading the Digits datasetdigits = datasets.load_digits()# To apply an classifier on this data, we need to flatten the image, to# turn the data in a (samples, feature) matrix:n_samples = len(digits.images)X = digits.images.reshape((n_samples, -1))y = digits.target# 將數據集分紅訓練集和測試集X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0)# 設置gridsearch的參數tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]#設置模型評估的方法.若是不清楚,能夠參考上面的k-fold章節裏面的超連接scores = ['precision', 'recall']for score in scores: print("# Tuning hyper-parameters for %s" % score) print() #構造這個GridSearch的分類器,5-fold clf = GridSearchCV(SVC(), tuned_parameters, cv=5, scoring='%s_weighted' % score) #只在訓練集上面作k-fold,而後返回最優的模型參數 clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() #輸出最優的模型參數 print(clf.best_params_) print() print("Grid scores on development set:") print() for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() #在測試集上測試最優的模型的泛化能力. y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred)) print()
上面這個例子就符合通常的套路.例子中的SVC是支持多分類的,其默認使用的是ovo的方式,若是須要改變,能夠將參數設置爲decision_function_shape=’ovr’,具體的能夠參看SVC的API文檔.
須要注意的幾個點
1.GridSearch支不支持多分類?
GridSearch只是在將參數組合好了,而後將數據使用k-fold的方式輸入到模型中,而後評估模型的準確性.其自己並非新的分類方法,因此只要你選擇的estimator能夠應用於多分類,就能夠.上面的例子手寫體的識別就是一個多分類的問題.你選擇的模型評估方法也須要知足多分類問題.當你使用roc_auc的時候評估模型的時候就須要注意數據格式.

2.GridSearch的estimator有的時候會出現嵌套,好比adaboost()集成學習中,就須要Gridsearch支持嵌套參數.雙下劃線__就表示該參數是嵌套參數,內層的參數.(這一點我沒有試驗過,只是看到有人這樣說…)固然gridsearch也有專門針對集成學習的API.
嵌套參數這篇博客有個例子:
———2017.4.18

的內容，更多

Gridsearch 機器 sklearn 簡單使用 Python 學習包的

的內容，請您使用右上方搜索功能獲取相關信息。