Spark機器學習庫中包含了兩種實現方式,一種是spark.mllib,這種是基礎的API,基於RDDs之上構建,另外一種是spark.ml,這種是higher-level API,基於DataFrames之上構建,spark.ml使用起來比較方便和靈活。html
Spark機器學習中關於特徵處理的API主要包含三個方面:特徵提取、特徵轉換與特徵選擇。本文經過例子介紹和學習Spark.ml中提供的關於特徵處理API中的特徵選擇(Feature Selectors)部分。sql
特徵選擇(Feature Selectors)apache
1. VectorSlicer數組
VectorSlicer用於從原來的特徵向量中切割一部分,造成新的特徵向量,好比,原來的特徵向量長度爲10,咱們但願切割其中的5~10做爲新的特徵向量,使用VectorSlicer能夠快速實現。機器學習
大數據/機器學習交流羣:724693112 歡迎你們一塊兒交流學習~學習
package com.lxw1234.spark.features.selectors大數據
import org.apache.spark.SparkConfspa
import org.apache.spark.SparkContextorm
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}htm
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
/**
* By http://lxw1234.com
*/
object TestVectorSlicer extends App {
val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//構造特徵數組
val data = Array(Row(Vectors.dense(-2.0, 2.3, 0.0)))
//爲特徵數組設置屬性名(字段名),分別爲f1 f2 f3
val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
val attrGroup = new AttributeGroup("userFeatures", attrs.asInstanceOf[Array[Attribute]])
//構造DataFrame
val dataRDD = sc.parallelize(data)
val dataset = sqlContext.createDataFrame(dataRDD, StructType(Array(attrGroup.toStructField())))
print("原始特徵:")
dataset.take(1).foreach(println)
//構造切割器
var slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
//根據索引號,截取原始特徵向量的第1列和第3列
slicer.setIndices(Array(0,2))
print("output1: ")
slicer.transform(dataset).select("userFeatures", "features").first()
//根據字段名,截取原始特徵向量的f2和f3
slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setNames(Array("f2","f3"))
print("output2: ")
slicer.transform(dataset).select("userFeatures", "features").first()
//索引號和字段名也能夠組合使用,截取原始特徵向量的第1列和f2
slicer = new VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(0)).setNames(Array("f2"))
print("output3: ")
slicer.transform(dataset).select("userFeatures", "features").first()
}
程序運行輸出爲:
原始特徵:
[[-2.0,2.3,0.0]]
output1:
org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,0.0]]
output2:
org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[2.3,0.0]]
output3:
org.apache.spark.sql.Row = [[-2.0,2.3,0.0],[-2.0,2.3]]
2. RFormula
RFormula用於將數據中的字段經過R語言的Model Formulae轉換成特徵值,輸出結果爲一個特徵向量和Double類型的label。關於R語言Model Formulae的介紹可參考:https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html
package com.lxw1234.spark.features.selectors
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.RFormula
/**
* By http://lxw1234.com
*/
object TestRFormula extends App {
val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//構造數據集
val dataset = sqlContext.createDataFrame(Seq(
(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")
dataset.select("id", "country", "hour", "clicked").show()
//當須要經過country和hour來預測clicked時候,
//構造RFormula,指定Formula表達式爲clicked ~ country + hour
val formula = new RFormula().setFormula("clicked ~ country + hour").setFeaturesCol("features").setLabelCol("label")
//生成特徵向量及label
val output = formula.fit(dataset).transform(dataset)
output.select("id", "country", "hour", "clicked", "features", "label").show()
}
程序輸出:
3. ChiSqSelector
ChiSqSelector用於使用卡方檢驗來選擇特徵(降維)。
package com.lxw1234.spark.features.selectors
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.mllib.linalg.Vectors
/**
* By http://lxw1234.com
*/
object TestChiSqSelector extends App {
val conf = new SparkConf().setMaster("local").setAppName("lxw1234.com")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//構造數據集
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
val df = sc.parallelize(data).toDF("id", "features", "clicked")
df.select("id", "features","clicked").show()
//使用卡方檢驗,將原始特徵向量(特徵數爲4)降維(特徵數爲3)
val selector = new ChiSqSelector().setNumTopFeatures(3).setFeaturesCol("features").setLabelCol("clicked").setOutputCol("selectedFeatures")
val result = selector.fit(df).transform(df)
result.show()
}
程序輸出爲: