Spark機器學習API之特徵處理(二)

Spark機器學習庫中包含了兩種實現方式,一種是spark.mllib,這種是基礎的API,基於RDDs之上構建,另外一種是spark.ml,這種是higher-level API,基於DataFrames之上構建,spark.ml使用起來比較方便和靈活。html

Spark機器學習中關於特徵處理的API主要包含三個方面:特徵提取、特徵轉換與特徵選擇。本文經過例子介紹和學習Spark.ml中提供的關於特徵處理API中的特徵選擇(Feature Selectors)部分。sql

特徵選擇(Feature Selectors)apache

1.  VectorSlicer數組

VectorSlicer用於從原來的特徵向量中切割一部分,造成新的特徵向量,好比,原來的特徵向量長度爲10,咱們但願切割其中的5~10做爲新的特徵向量,使用VectorSlicer能夠快速實現。機器學習

大數據/機器學習交流羣:724693112 歡迎你們一塊兒交流學習~學習

package com.lxw1234.spark.features.selectors大數據

 

import org.apache.spark.SparkConfspa

import org.apache.spark.SparkContextorm

 

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}htm

import org.apache.spark.ml.feature.VectorSlicer

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

 

/**

* By  http://lxw1234.com

*/

object TestVectorSlicer extends App {

    val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

    val sc = new SparkContext(conf)

 

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._

 

 

    //構造特徵數組

    val data = Array(Row(Vectors.dense(-2.0, 2.3, 0.0)))

 

    //爲特徵數組設置屬性名(字段名),分別爲f1 f2 f3

    val defaultAttr = NumericAttribute.defaultAttr

    val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)

    val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])

 

    //構造DataFrame

    val dataRDD = sc.parallelize(data)

    val dataset = sqlContext.createDataFrame(dataRDD, StructType(Array(attrGroup.toStructField())))

 

    print("原始特徵:")

    dataset.take(1).foreach(println)

 

 

    //構造切割器

    var slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

 

    //根據索引號,截取原始特徵向量的第1列和第3列

    slicer.setIndices(Array(0,2))

    print("output1: ")

    slicer.transform(dataset).select("userFeatures", "features").first()

 

    //根據字段名,截取原始特徵向量的f2和f3

    slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

    slicer.setNames(Array("f2","f3"))

    print("output2: ")

    slicer.transform(dataset).select("userFeatures", "features").first()

 

    //索引號和字段名也能夠組合使用,截取原始特徵向量的第1列和f2

    slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")

    slicer.setIndices(Array(0)).setNames(Array("f2"))

    print("output3: ")

    slicer.transform(dataset).select("userFeatures", "features").first()

 

 

}

程序運行輸出爲:

原始特徵:

[[-2.0,2.3,0.0]]

 

output1:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,0.0]]

 

output2:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[2.3,0.0]]

 

output3:

org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,2.3]]

 

2.  RFormula

RFormula用於將數據中的字段經過R語言的Model Formulae轉換成特徵值,輸出結果爲一個特徵向量和Double類型的label。關於R語言Model Formulae的介紹可參考:https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html

package com.lxw1234.spark.features.selectors

 

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

 

import org.apache.spark.ml.feature.RFormula

 

/**

* By  http://lxw1234.com

*/

object TestRFormula extends App {

 

    val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

    val sc = new SparkContext(conf)

 

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._

 

    //構造數據集

    val dataset = sqlContext.createDataFrame(Seq(

      (7, "US", 18, 1.0),

      (8, "CA", 12, 0.0),

      (9, "NZ", 15, 0.0)

    )).toDF("id", "country", "hour", "clicked")

    dataset.select("id", "country", "hour", "clicked").show()

 

    //當須要經過country和hour來預測clicked時候,

    //構造RFormula,指定Formula表達式爲clicked ~ country + hour

    val formula = new RFormula().setFormula("clicked ~ country + hour").setFeaturesCol("features").setLabelCol("label")

    //生成特徵向量及label

    val output = formula.fit(dataset).transform(dataset)

    output.select("id", "country", "hour", "clicked", "features", "label").show()

 

}

程序輸出:


 

 


 

 

3.  ChiSqSelector

ChiSqSelector用於使用卡方檢驗來選擇特徵(降維)。

package com.lxw1234.spark.features.selectors

 

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.ml.feature.ChiSqSelector

import org.apache.spark.mllib.linalg.Vectors

 

/**

* By  http://lxw1234.com

*/

object TestChiSqSelector extends App {

 

    val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")

    val sc = new SparkContext(conf)

 

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._

 

    //構造數據集

    val data = Seq(

      (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),

      (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),

      (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)

    )

    val df = sc.parallelize(data).toDF("id", "features", "clicked")

    df.select("id", "features","clicked").show()

 

    //使用卡方檢驗,將原始特徵向量(特徵數爲4)降維(特徵數爲3)

    val selector = new ChiSqSelector().setNumTopFeatures(3).setFeaturesCol("features").setLabelCol("clicked").setOutputCol("selectedFeatures")

 

    val result = selector.fit(df).transform(df)

    result.show()

 

}

程序輸出爲:


 
相關文章
相關標籤/搜索