[Feature] Feature selection

時間 2019-11-18

標籤 feature selection 简体版

原文原文鏈接

Ref: 1.13. Feature selectionhtml

Ref: 1.13. 特徵選擇(Feature selection)git

大綱列表

3.1 Filter算法

3.1.1 方差選擇法app

3.1.2 相關係數法dom

3.1.3 卡方檢驗機器學習

3.1.4 互信息法函數

3.2 Wrapperpost

3.2.1 遞歸特徵消除法學習

3.3 Embeddedurl

3.3.1 基於懲罰項的特徵選擇法

3.3.2 基於樹模型的特徵選擇法

類	所屬方式	說明
VarianceThreshold	Filter	方差選擇法
SelectKBest	Filter	可選關聯繫數、卡方校驗、最大信息係數做爲得分計算的方法
RFE	Wrapper	遞歸地訓練基模型，將權值係數較小的特徵從特徵集合中消除
SelectFromModel	Embedded	訓練基模型，選擇權值係數較高的特徵

策略依據

從兩個方面考慮來選擇特徵：

- 特徵是否發散：若是一個特徵不發散，例如方差接近於0，也就是說樣本在這個特徵上基本上沒有差別，這個特徵對於樣本的區分並無什麼用。
- 特徵與目標的相關性：這點比較顯見，與目標相關性高的特徵，應當優選選擇。除方差法外，本文介紹的其餘方法均從相關性考慮。

　　根據特徵選擇的形式又能夠將特徵選擇方法分爲3種：

- Filter：過濾法，按照發散性或者相關性對各個特徵進行評分，設定閾值或者待選擇閾值的個數，選擇特徵。
- Wrapper：包裝法，根據目標函數（一般是預測效果評分），每次選擇若干特徵，或者排除若干特徵。
- Embedded：嵌入法，先使用某些機器學習的算法和模型進行訓練，獲得各個特徵的權值係數，根據係數從大到小選擇特徵。相似於Filter方法，可是是經過訓練來肯定特徵的優劣。

特徵選擇

Filter

1、方差選擇法

假設咱們有一個帶有布爾特徵的數據集，咱們要移除那些超過80%的數據都爲1或0的特徵。

結論：第一列被移除。

>>> from sklearn.feature_selection import VarianceThreshold >>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

2、卡方檢驗

支持稀疏數據。經常使用的兩個API：

(1) SelectKBest 移除得分前 $k$ 名之外的全部特徵

(2) SelectPercentile 移除得分在用戶指定百分比之後的特徵

from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2

# 找到最佳的2個特徵
rst = SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)
print(rst[:5])

參數設置

加入噪聲列屬性（特徵），檢測打分機制。

(1) 用於迴歸: f_regression

(2) 用於分類: chi2 or f_classif

#%%
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif, chi2

###############################################################################
# import some data to play with

# The iris dataset
iris = datasets.load_iris()

# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target

###############################################################################
plt.figure(1)
plt.clf()

X_indices = np.arange(X.shape[-1])

###############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features

# selector = SelectPercentile(f_classif, percentile=10)
selector = SelectPercentile(chi2, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
        label=r'Univariate score ($-Log(p_{value})$)', color='g')

f_classif 的結果

ch2 的結果

3、皮爾遜相關係數

4、互信息法

連接：https://www.zhihu.com/question/28641663/answer/41653367

計算每個特徵與響應變量的相關性，工程上經常使用的手段有計算皮爾遜係數和互信息係數，

皮爾遜係數只能衡量線性相關性；

互信息係數可以很好地度量各類相關性，可是計算相對複雜一些。

Wrapper

1、遞歸特徵消除法

原理就是給每一個「特徵」打分：

首先，預測模型在原始特徵上訓練，每項特徵指定一個權重。

以後，那些擁有最小絕對值權重的特徵被踢出特徵集。

如此往復遞歸，直至剩餘的特徵數量達到所需的特徵數量。

(1) Recursive feature elimination: 一個遞歸特徵消除的示例，展現了在數字分類任務中，像素之間的相關性。

(2) Recursive feature elimination with cross-validation: 一個遞歸特徵消除示例，經過交叉驗證的方式自動調整所選特徵的數量。

print(__doc__)

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

######################################################## # Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)

rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()

對64個特徵的重要性進行繪圖，以下：

$ print(ranking)

[[64 50 31 23 10 17 34 51]
 [57 37 30 43 14 32 44 52]
 [54 41 19 15 28  8 39 53]
 [55 45  9 18 20 38  1 59]
 [63 42 25 35 29 16  2 62]
 [61 40  5 11 13  6  4 58]
 [56 47 26 36 24  3 22 48]
 [60 49  7 27 33 21 12 46]]

Embedded

1、基於懲罰項的特徵選擇法

2、基於樹模型的特徵選擇法

該話題獨立成章，詳見: [Feature] Feature selection - Embedded topic

集成 pipeline

以下代碼片斷中，

(1) 咱們將 sklearn.svm.LinearSVC 和 sklearn.feature_selection.SelectFromModel 結合來評估特徵的重要性，並選擇最相關的特徵。

(2) 以後 sklearn.ensemble.RandomForestClassifier 模型使用轉換後的輸出訓練，即只使用被選出的相關特徵。

Ref: sklearn.pipeline.Pipeline

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])

clf.fit(X, y)

降維

1、主成分分析法（PCA）

2、線性判別分析法（LDA）

Goto: [Scikit-learn] 4.4 Dimensionality reduction - PCA

Ref: [Scikit-learn] 2.5 Dimensionality reduction - Probabilistic PCA & Factor Analysis

Ref: [Scikit-learn] 2.5 Dimensionality reduction - ICA

Goto: [Scikit-learn] 1.2 Dimensionality reduction - Linear and Quadratic Discriminant Analysis

End.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。