簡述
特徵的選取方式一共有三種,在sklearn實現了的包裹式(wrapper)特診選取只有兩個遞歸式特徵消除的方法,以下:html
recursive feature elimination ( RFE )
經過學習器返回的 coef_ 屬性 或者 feature_importances_ 屬性來得到每一個特徵的重要程度。 而後,從當前的特徵集合中移除最不重要的特徵。在特徵集合上不斷的重複遞歸這個步驟,直到最終達到所須要的特徵數量爲止。RFECV
經過交叉驗證來找到最優的特徵數量。若是減小特徵會形成性能損失,那麼將不會去除任何特徵。這個方法用以選取單模型特徵至關不錯,可是有兩個缺陷,一,計算量大。二,隨着學習器(評估器)的改變,最佳特徵組合也會改變,有些時候會形成不利影響。
RFE
性能升降問題
PFE 自身的特性,使得咱們能夠比較好的進行手動的特徵選擇,可是一樣的他也存在原模型在去除特徵後的數據集上的性能表現要差於原數據集,這和方差過濾同樣,一樣是由於去除的特徵中保留有有效信息的緣由。下面的代碼就很好的展現了這種現象。python
from sklearn.feature_selection import RFE, RFECV from sklearn.svm import LinearSVC from sklearn.datasets import load_iris from sklearn import model_selection iris = load_iris() X, y = iris.data, iris.target ## 特徵提取 estimator = LinearSVC() selector = RFE(estimator=estimator, n_features_to_select=2) X_t = selector.fit_transform(X, y) ### 切分測試集與驗證集 X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0, stratify=y) X_train_t, X_test_t, y_train_t, y_test_t = model_selection.train_test_split(X_t, y, test_size=0.25, random_state=0, stratify=y) ## 測試與驗證 clf = LinearSVC() clf_t = LinearSVC() clf.fit(X_train, y_train) clf_t.fit(X_train_t, y_train_t) print("Original DataSet: test score=%s" % (clf.score(X_test, y_test))) print("Selected DataSet: test score=%s" % (clf_t.score(X_test_t, y_test_t)))
Original DataSet: test score=0.973684210526 Selected DataSet: test score=0.947368421053
從上面的代碼咱們能夠看出,原模型的性能在使用RFE後確實降低了,如同方差過濾,單變量特徵選取同樣,這種方式看來使用這個方法咱們也須要謹慎一些啊。apache
一些重要的屬性與參數
- n_features_to_select :選出的特徵整數時爲選出特徵的個數,None時選取一半
- step : 整數時,每次去除的特徵個數,小於1時,每次去除權重最小的特徵
print("N_features %s" % selector.n_features_) # 保留的特徵數 print("Support is %s" % selector.support_) # 是否保留 print("Ranking %s" % selector.ranking_) # 重要程度排名
N_features 2 Support is [False True False True] Ranking [3 1 2 1]
RFECV
原理與特性
使用交叉驗證來保留最佳性能的特徵。不過這裏的交叉驗證的數據集切割對象再也不是 行數據(樣本),而是列數據(特徵),同時學習器自己不變,最終獲得不一樣特徵對於score的重要程度,而後保留最佳的特徵組合。其分割方式相似於隨機森林中的列上子採樣。app
一些重要的屬性與參數
- step : 整數時,每次去除的特徵個數,小於1時,每次去除權重最小的特徵
- scoring : 字符串類型,選擇sklearn中的
scorer
做爲輸入對象 - cv :
- 默認爲3折
- 整數爲cv數
- object:用做交叉驗證生成器的對象
- An iterable yielding train/test splits.
對於 迭代器或者沒有輸入(None), 若是 y 是 二進制 或者 多類,則使用 sklearn.model_selection.StratifiedKFold
. 若是學習器是個分類器 或者 若是 y 不是 二進制 或者 多類,使用 sklearn.model_selection.KFold
.dom
若是你對於前面的花不太理解,那麼你能夠看一下下面的例子,或者本身動手嘗試一下函數
例子一
對於前面RFE中的數據集進行驗證,應當應該保留那些特徵:性能
iris = load_iris() X = iris.data y = iris.target estimator = LinearSVC() selector = RFECV(estimator=estimator, cv=3) selector.fit(X, y) print("N_features %s" % selector.n_features_) print("Support is %s" % selector.support_) print("Ranking %s" % selector.ranking_) print("Grid Scores %s" % selector.grid_scores_)
N_features 4 Support is [ True True True True] Ranking [1 1 1 1] Grid Scores [ 0.91421569 0.94689542 0.95383987 0.96691176]
好吧,看來都應該保留學習
例子二
RFECV的強大做用:測試
import matplotlib.pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import StratifiedKFold from sklearn.feature_selection import RFECV from sklearn.datasets import make_classification # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=25, n_informative=3, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0) # Create the RFE object and compute a cross-validated score. svc = SVC(kernel="linear") # The "accuracy" scoring is proportional to the number of correct # classifications rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(2), scoring='accuracy') rfecv.fit(X, y) print("Optimal number of features : %d" % rfecv.n_features_) print("Ranking of features : %s" % rfecv.ranking_) # Plot number of features VS. cross-validation scores plt.figure() plt.xlabel("Number of features selected") plt.ylabel("Cross validation score (nb of correct classifications)") plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_) plt.show()
Optimal number of features : 3 Ranking of features : [ 5 1 12 19 15 6 17 1 2 21 23 11 16 10 13 22 8 14 1 20 7 9 3 4 18]
(劃重點了,咳咳)ui
經過RFECV咱們得知,原來只須要三個特徵就行了,首先這確實符合咱們構造的數據,同時這也向咱們展現了RFECV的強大潛力,看來它將成爲咱們以後進行特徵選取的一個重要助手(^o^)/~
三個特殊的多類比較特徵選擇
假陽性率(false positive rate) SelectFpr
僞發現率(false discovery rate) SelectFdr
或者族系偏差(family wise error) SelectFwe
其實際意義請參考 wiki:Multiple_comparisons_problem
下面是代碼展現
from sklearn.feature_selection import SelectFdr,f_classif,SelectFpr,SelectFwe,chi2,mutual_info_classif iris = load_iris() X = iris.data y = iris.target selector1 = SelectFpr(score_func = mutual_info_classif,alpha=0.5) # alpha是預期錯誤發現率的上限,默認是0.5,score_func 默認爲 f_classif selector1.fit(X, y) print("\nScores of features %s" % selector1.scores_) print("p-values of feature scores is %s" % selector1.pvalues_) # print("Shape after transform is ",selector1.transform(X).shape) selector2 = SelectFdr(score_func = f_classif,alpha=4.37695696e-80) # alpha是預期錯誤發現率的上限 selector2.fit(X, y) print("\nScores of features %s" % selector2.scores_) print("p-values of feature scores is %s" % selector2.pvalues_) print("Shape after transform is ",selector2.transform(X).shape) selector3 = SelectFwe(score_func = chi2,alpha=1) # alpha是預期錯誤發現率的上限 selector3.fit(X, y) print("\nScores of features %s" % selector3.scores_) print("p-values of feature scores is %s" % selector3.pvalues_) print("Shape after transform is ",selector3.transform(X).shape)
輸出: Scores of features [ 0.54158942 0.21711645 0.99669173 0.99043692] p-values of feature scores is None Scores of features [ 119.26450218 47.3644614 1179.0343277 959.32440573] p-values of feature scores is [ 1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85] Shape after transform is (150, 2) Scores of features [ 10.81782088 3.59449902 116.16984746 67.24482759] p-values of feature scores is [ 4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15] Shape after transform is (150, 4)
通用RFE:GenericUnivariateSelect
在學習了前面的RFE以後,sklearn還封裝了一個通用的RFE:GenericUnivariateSelect,它能夠經過超參數來設置咱們須要的RFE,一共是三個超參數灰常簡單易用。
- score_func : 評價函數(和前面的意思同樣)
- mode : sklearn 封裝的模型
- param : 以前sklearn中封裝的模型都有一個相應的控制閾值的超參數 param,此處意義相同
下面是一個簡單的小例子
from sklearn.feature_selection import GenericUnivariateSelect iris = load_iris() X = iris.data y = iris.target estimator = LinearSVC() selector = GenericUnivariateSelect(score_func=f_classif,mode='fpr',param= 0.5) # mode : {'percentile', 'k_best', 'fpr', 'fdr', 'fwe'} selector.fit(X, y) print("\nScores of features %s" % selector.scores_) print("p-values of feature scores is %s" % selector.pvalues_) print("Shape after transform is ",selector.transform(X).shape) print("Support is ",selector.get_support()) print("Params is ",selector.get_params())
Scores of features [ 119.26450218 47.3644614 1179.0343277 959.32440573] p-values of feature scores is [ 1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85] Shape after transform is (150, 4) Support is [ True True True True] Params is {'mode': 'fpr', 'param': 0.5, 'score_func': <function f_classif at 0x7f6ecee7d7b8>}