特徵選擇

時間 2019-11-26

標籤特徵選擇简体版

原文原文鏈接

1, 去掉取值變化小的特徵（Removing features with low variance）

sklearn.feature_selection.VarianceThreshold(threshold=0.0)git

2, 單變量特徵選擇（Univariate feature selection）

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)算法

選擇前k個分數較高的特徵，去掉其餘的特徵dom

sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)機器學習

f_regression（單因素線性迴歸試驗）用做迴歸
chi2卡方檢驗，f_classif（方差分析的F值）等用做分類函數

選擇必定百分比的最高的評分的特徵。學習

sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05)spa

根據配置的參選搜索.net

sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode='percentile', param=1e-05設計

3,遞歸特徵消除Recursive feature elimination （RFE）

遞歸特徵消除的主要思想是反覆的構建模型（如SVM或者回歸模型）而後選出最好的（或者最差的）的特徵（能夠根據係數來選），把選出來的特徵選擇出來，而後在剩餘的特徵上重複這個過程，直到全部特徵都遍歷了。這個過程當中特徵被消除的次序就是特徵的排序。所以，這是一種尋找最優特徵子集的貪心算法。
RFE的穩定性很大程度上取決於在迭代的時候底層用哪一種模型。例如，假如RFE採用的普通的迴歸，沒有通過正則化的迴歸是不穩定的，那麼RFE就是不穩定的；假如採用的是Ridge，而用Ridge正則化的迴歸是穩定的，那麼RFE就是穩定的。rest

class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, estimator_params=None, verbose=0)

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

# Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()

4, Feature selection using SelectFromModel

SelectFromModel 是一個 meta-transformer，能夠和在訓練完後有一個coef_ 或者 feature_importances_ 屬性的評估器（機器學習算法）一塊兒使用。
若是相應的coef_ 或者feature_importances_ 的值小於設置的閥值參數，這些特徵能夠視爲不重要或者刪除。除了指定閥值參數外，也能夠經過設置一個字符串參數，使用內置的啓發式搜索找到夜歌閥值。可使用的字符串參數包括：「mean」, 「median」以及這兩的浮點乘積，例如「0.1*mean」.

sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)

與Lasso一塊兒使用，從boston數據集中選擇最好的兩組特徵值。

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]

# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
sfm.threshold += 0.1
X_transform = sfm.transform(X)
n_features = X_transform.shape[1]

# Plot the selected two features from X.
plt.title(
"Features selected from Boston using SelectFromModel with "
"threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1]
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()

4.1,L1-based feature selection

L1正則化將係數w的l1範數做爲懲罰項加到損失函數上，因爲正則項非零，這就迫使那些弱的特徵所對應的係數變成0。所以L1正則化每每會使學到的模型很稀疏（係數w常常爲0），

這個特性使得L1正則化成爲一種很好的特徵選擇方法。

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
＃(150, 4)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape
＃(150, 3)

4.2, 隨機稀疏模型Randomized sparse models

面臨一些相互關聯的特徵是基於L1的稀疏模型的限制，由於模型只選擇其中一個特徵。爲了減小這個問題，可使用隨機特徵選擇方法，經過打亂設計的矩陣或者子採樣的數據並，屢次從新估算稀疏模型，而且統計有多少次一個特定的迴歸量是被選中。

RandomizedLasso使用Lasso實現迴歸設置

sklearn.linear_model.RandomizedLasso(alpha='aic', scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, fit_intercept=True, verbose=False, normalize=True, precompute='auto', max_iter=500, eps=2.2204460492503131e-16, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))
1
RandomizedLogisticRegression 使用邏輯迴歸 logistic regression，適合分類任務

sklearn.linear_model.RandomizedLogisticRegression(C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, tol=0.001, fit_intercept=True, verbose=False, normalize=True, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))

4.3, 基於樹的特徵選擇Tree-based feature selection
基於樹的評估器 (查看sklearn.tree 模塊以及在sklearn.ensemble模塊中的樹的森林) 能夠被用來計算特徵的重要性，根據特徵的重要性去掉可有可無的特徵 (當配合sklearn.feature_selection.SelectFromModel meta-transformer):

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
＃(150, 4)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape
＃(150, 2)

5, Feature selection as part of a pipeline
在進行學習以前，特徵選擇一般被用做預處理步驟。在scikit-learn中推薦使用的處理的方法是sklearn.pipeline.Pipeline

sklearn.pipeline.Pipeline(steps)
1
Pipeline of transforms with a final estimator.
Sequentially 應用一個包含 transforms and a final estimator的列表，pipeline中間的步驟必須是‘transforms’, 也就是它們必須完成fit 以及transform 方法s. final estimator 僅僅只須要完成 fit方法.

使用pipeline是將來組合多個能夠在設置不一樣參數時進行一塊兒交叉驗證的步驟。所以，它容許設置不一樣步驟中的參數事使用參數名，這些參數名使用‘__’進行分隔。以下實例中所示：

from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = samples_generator.make_classification(
... n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
# You can set the parameters using the names issued
# For instance, fit using a k of 10 in the SelectKBest
# and a parameter 'C' of the svm
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
...
Pipeline(steps=[...])
prediction = anova_svm.predict(X)
anova_svm.score(X, y)
0.77...
# getting the selected features chosen by anova_filter
anova_svm.named_steps['anova'].get_support()
＃array([ True, True, True, False, False, True, False, True, True, True,
False, False, True, False, True, False, False, False, False,
True], dtype=bool)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
簡單語法示例：

clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)

做者：面向將來的歷史來源：CSDN 原文：https://blog.csdn.net/a1368783069/article/details/52048349

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。