[ML] Feature Selectors

SparkML中關於特徵的算法可分爲:Extractors(特徵提取)Transformers(特徵轉換)Selectors(特徵選擇)三部分。html

Ref: SparkML中三種特徵選擇算法(VectorSlicer/RFormula/ChiSqSelector)算法

 

 

1、代碼示範

VectorSlicer 只是根據index而「手動指定特徵」的手段,不是特徵選擇的依據。 sql

RFormula 也只是根據column而「手動指定特徵」的手段,不是特徵選擇的依據。 apache

VectorSlicer
from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors from pyspark.sql.types import Row df = spark.createDataFrame([ Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})), Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]))]) df.show() +--------------------+ | userFeatures| +--------------------+ |(3,[0,1],[-2.0,2.3])| | [-2.0,2.3,0.0]| +--------------------+ slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1]) output = slicer.transform(df) output.select("userFeatures", "features").show() +--------------------+-------------+ | userFeatures| features| +--------------------+-------------+ |(3,[0,1],[-2.0,2.3])|(1,[0],[2.3])| | [-2.0,2.3,0.0]| [2.3]| +--------------------+-------------+ RFormula
from pyspark.ml.feature import RFormula dataset = spark.createDataFrame( [(7, "US", 18, 1.0), (8, "CA", 12, 0.0), (9, "NZ", 15, 0.0)], ["id", "country", "hour", "clicked"]) formula = RFormula( formula="clicked ~ country + hour",  # 指定使用兩個特徵,country特徵會自動採用one hot編碼。 featuresCol="features", labelCol="label") output = formula.fit(dataset).transform(dataset) output.select("features", "label").show() +--------------+-----+ | features|label| +--------------+-----+ |[0.0,0.0,18.0]| 1.0| |[0.0,1.0,12.0]| 0.0| |[1.0,0.0,15.0]| 0.0| +--------------+-----+ ChiSqSelector
from pyspark.ml.feature import ChiSqSelector from pyspark.ml.linalg import Vectors df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) selector = ChiSqSelector(numTopFeatures=1, featuresCol="features", outputCol="selectedFeatures", labelCol="clicked") result = selector.fit(df).transform(df) print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures()) result.show() ChiSqSelector output with top 1 features selected +---+------------------+-------+----------------+ | id| features|clicked|selectedFeatures| +---+------------------+-------+----------------+ | 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]| | 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]| | 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]| +---+------------------+-------+----------------+

 

 

2、實踐心得

參考:[Feature] Feature selectionapp

Outline

3.1 Filterpost

3.1.1 方差選擇法編碼

3.1.2 相關係數法url

3.1.3 卡方檢驗    # <---- ChiSqSelectorspa

3.1.4 互信息法.net

3.2 Wrapper

3.2.1 遞歸特徵消除法

3.3 Embedded

3.3.1 基於懲罰項的特徵選擇法

3.3.2 基於樹模型的特徵選擇法

  

相關係數

fraud_pd.corr('balance', 'numTrans')

n_numerical = len(numerical)
corr = []
for i in range(0, n_numerical):
    temp = [None] * i
    
    for j in range(i, n_numerical):
        temp.append(fraud_pd.corr(numerical[i], numerical[j]))
    corr.append(temp)  

print(corr)

Output: 

[[1.0,  0.00044, 0.00027],

 [None, 1.0,    -0.00028],

 [None, None,    1.0]] 

 

 

3、Embedded

Ref: [Feature] Feature selection - Embedded topic

問題,spark.ml能夠lasso線性迴歸麼?2.4.4貌似沒有,但mllib裏有,功能完善度不是很滿意。

classification (SVMs, logistic regression)

linear regression (least squares, Lasso, ridge)

後者採樣後,使用sklearn處理畫出"軌跡圖"。 

使用Spark SQL在DataFrame中採樣構成子數據集的過程。

 

End.

相關文章
相關標籤/搜索