SparkML中關於特徵的算法可分爲:Extractors(特徵提取)、Transformers(特徵轉換)、Selectors(特徵選擇)三部分。html
Ref: SparkML中三種特徵選擇算法(VectorSlicer/RFormula/ChiSqSelector)算法
VectorSlicer 只是根據index而「手動指定特徵」的手段,不是特徵選擇的依據。 sql
RFormula 也只是根據column而「手動指定特徵」的手段,不是特徵選擇的依據。 apache
VectorSlicer
from pyspark.ml.feature import VectorSlicer from pyspark.ml.linalg import Vectors from pyspark.sql.types import Row df = spark.createDataFrame([ Row(userFeatures=Vectors.sparse(3, {0: -2.0, 1: 2.3})), Row(userFeatures=Vectors.dense([-2.0, 2.3, 0.0]))]) df.show() +--------------------+ | userFeatures| +--------------------+ |(3,[0,1],[-2.0,2.3])| | [-2.0,2.3,0.0]| +--------------------+ slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1]) output = slicer.transform(df) output.select("userFeatures", "features").show() +--------------------+-------------+ | userFeatures| features| +--------------------+-------------+ |(3,[0,1],[-2.0,2.3])|(1,[0],[2.3])| | [-2.0,2.3,0.0]| [2.3]| +--------------------+-------------+ RFormula
from pyspark.ml.feature import RFormula dataset = spark.createDataFrame( [(7, "US", 18, 1.0), (8, "CA", 12, 0.0), (9, "NZ", 15, 0.0)], ["id", "country", "hour", "clicked"]) formula = RFormula( formula="clicked ~ country + hour", # 指定使用兩個特徵,country特徵會自動採用one hot編碼。 featuresCol="features", labelCol="label") output = formula.fit(dataset).transform(dataset) output.select("features", "label").show() +--------------+-----+ | features|label| +--------------+-----+ |[0.0,0.0,18.0]| 1.0| |[0.0,1.0,12.0]| 0.0| |[1.0,0.0,15.0]| 0.0| +--------------+-----+ ChiSqSelector
from pyspark.ml.feature import ChiSqSelector from pyspark.ml.linalg import Vectors df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) selector = ChiSqSelector(numTopFeatures=1, featuresCol="features", outputCol="selectedFeatures", labelCol="clicked") result = selector.fit(df).transform(df) print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures()) result.show() ChiSqSelector output with top 1 features selected +---+------------------+-------+----------------+ | id| features|clicked|selectedFeatures| +---+------------------+-------+----------------+ | 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]| | 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]| | 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]| +---+------------------+-------+----------------+
參考:[Feature] Feature selectionapp
3.1 Filterpost
3.1.1 方差選擇法編碼
3.1.2 相關係數法url
3.1.3 卡方檢驗 # <---- ChiSqSelectorspa
3.1.4 互信息法.net
3.2 Wrapper
3.2.1 遞歸特徵消除法
3.3 Embedded
3.3.1 基於懲罰項的特徵選擇法
3.3.2 基於樹模型的特徵選擇法
fraud_pd.corr('balance', 'numTrans') n_numerical = len(numerical) corr = [] for i in range(0, n_numerical): temp = [None] * i for j in range(i, n_numerical): temp.append(fraud_pd.corr(numerical[i], numerical[j])) corr.append(temp)
print(corr)
Output:
[[1.0, 0.00044, 0.00027],
[None, 1.0, -0.00028],
[None, None, 1.0]]
Ref: [Feature] Feature selection - Embedded topic
問題,spark.ml能夠lasso線性迴歸麼?2.4.4貌似沒有,但mllib裏有,功能完善度不是很滿意。
classification (SVMs, logistic regression)
linear regression (least squares, Lasso, ridge)
後者採樣後,使用sklearn處理畫出"軌跡圖"。
使用Spark SQL在DataFrame中採樣構成子數據集的過程。
End.