Spark ML 中 VectorIndexer, StringIndexer等用法（轉載）

時間 2019-12-07

標籤 spark vectorindexer stringindexer 等用轉載欄目 Spark 简体版

原文原文鏈接

VectorIndexer

主要做用：提升決策樹或隨機森林等ML方法的分類效果。
VectorIndexer是對數據集特徵向量中的類別（離散值）特徵（index categorical features categorical features ）進行編號。
它可以自動判斷那些特徵是離散值型的特徵，並對他們進行編號，具體作法是經過設置一個maxCategories，特徵向量中某一個特徵不重複取值個數小於maxCategories，則被從新編號爲0～K（K<=maxCategories-1）。某一個特徵不重複取值個數大於maxCategories，則該特徵視爲連續值，不會從新編號（不會發生任何改變）。結合例子看吧，實在太繞了。apache

    VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following: Take an input column of type Vector and a parameter maxCategories. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical. Compute 0-based category indices for each categorical feature. Index categorical features and transform original feature values to indices. Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance. This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.

用一個簡單的數據集舉例以下：app

//定義輸入輸出列和最大類別數爲5，某一個特徵 //（即某一列）中多於5個取值視爲連續值
VectorIndexerModel featureIndexerModel=new VectorIndexer() .setInputCol("features") .setMaxCategories(5) .setOutputCol("indexedFeatures") .fit(rawData); //加入到Pipeline
Pipeline pipeline=new Pipeline() .setStages(new PipelineStage[] {labelIndexerModel, featureIndexerModel, dtClassifier, converter}); pipeline.fit(rawData).transform(rawData).select("features","indexedFeatures").show(20,false); //顯示以下的結果： 
+-------------------------+-------------------------+
|features                 |indexedFeatures          |
+-------------------------+-------------------------+
|(3,[0,1,2],[2.0,5.0,7.0])|(3,[0,1,2],[2.0,1.0,1.0])|
|(3,[0,1,2],[3.0,5.0,9.0])|(3,[0,1,2],[3.0,1.0,2.0])|
|(3,[0,1,2],[4.0,7.0,9.0])|(3,[0,1,2],[4.0,3.0,2.0])|
|(3,[0,1,2],[2.0,4.0,9.0])|(3,[0,1,2],[2.0,0.0,2.0])|
|(3,[0,1,2],[9.0,5.0,7.0])|(3,[0,1,2],[9.0,1.0,1.0])|
|(3,[0,1,2],[2.0,5.0,9.0])|(3,[0,1,2],[2.0,1.0,2.0])|
|(3,[0,1,2],[3.0,4.0,9.0])|(3,[0,1,2],[3.0,0.0,2.0])|
|(3,[0,1,2],[8.0,4.0,9.0])|(3,[0,1,2],[8.0,0.0,2.0])|
|(3,[0,1,2],[3.0,6.0,2.0])|(3,[0,1,2],[3.0,2.0,0.0])|
|(3,[0,1,2],[5.0,9.0,2.0])|(3,[0,1,2],[5.0,4.0,0.0])|
+-------------------------+-------------------------+ 結果分析：特徵向量包含3個特徵，即特徵0，特徵1，特徵2。如Row=1,對應的特徵分別是2.0,5.0,7.0.被轉換爲2.0,1.0,1.0。 咱們發現只有特徵1，特徵2被轉換了，特徵0沒有被轉換。這是由於特徵0有6中取值（2，3，4，5，8，9），多於前面的設置setMaxCategories(5) ，所以被視爲連續值了，不會被轉換。 特徵1中，（4，5，6，7，9）-->(0,1,2,3,4,5) 特徵2中, (2,7,9)-->(0,1,2) 輸出DataFrame格式說明（Row=1）： 3個特徵 特徵0，1，2 轉換前的值 |(3,    [0,1,2],      [2.0,5.0,7.0]) 3個特徵 特徵1，1，2 轉換後的值 |(3,    [0,1,2],      [2.0,1.0,1.0])|

`StringIndexer`

理解了前面的VectorIndexer以後，StringIndexer對數據集的label進行從新編號就很容易理解了，都是採用相似的轉換思路，看下面的例子就能夠了。ide

//定義一個StringIndexerModel，將label轉換成indexedlabel
StringIndexerModel labelIndexerModel=new StringIndexer(). setInputCol("label") .setOutputCol("indexedLabel") .fit(rawData); //加labelIndexerModel加入到Pipeline中
Pipeline pipeline=new Pipeline() .setStages(new PipelineStage[] {labelIndexerModel, featureIndexerModel, dtClassifier, converter}); //查看結果
pipeline.fit(rawData).transform(rawData).select("label","indexedLabel").show(20,false); 按label出現的頻次，轉換成0～num numOfLabels-1(分類個數)，頻次最高的轉換爲0，以此類推： label=3，出現次數最多，出現了4次，轉換（編號）爲0 其次是label=2，出現了3次，編號爲1，以此類推 +-----+------------+
|label|indexedLabel|
+-----+------------+
|3.0  |0.0         |
|4.0  |3.0         |
|1.0  |2.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|3.0  |0.0         |
|2.0  |1.0         |
|1.0  |2.0         |
+-----+------------+

在其它地方應用StringIndexer時還須要注意兩個問題：
（1）StringIndexer本質上是對String類型–>index( number);若是是：數值(numeric)–>index(number),其實是對把數值先進行了類型轉換（ cast numeric to string and then index the string values.），也就是說不管是String，仍是數值，均可以從新編號（Index);
（2）利用得到的模型轉化新數據集時，可能遇到異常狀況，見下面例子。函數

StringIndexer對String按頻次進行編號 id | category | categoryIndex ----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0 若是轉換模型（關係）是基於上面數據獲得的 (a,b,c)->(0.0,2.0,1.0),若是用此模型轉換category多於（a,b,c)的數據，好比多了d，e，就會遇到麻煩： id | category | categoryIndex ----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | d        | ？ 3  | e        | ？ 4  | a        | 0.0
 5  | c        | 1.0 Spark提供了兩種處理方式： StringIndexerModel labelIndexerModel=new StringIndexer(). setInputCol("label") .setOutputCol("indexedLabel") //.setHandleInvalid("error")
                .setHandleInvalid("skip") .fit(rawData); （1）默認設置，也就是.setHandleInvalid("error")：會拋出異常 org.apache.spark.SparkException: Unseen label: d，e （2）.setHandleInvalid("skip") 忽略這些label所在行的數據，正常運行，將輸出以下結果： id | category | categoryIndex ----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 4  | a        | 0.0
 5  | c        | 1.0

`IndexToString`

相應的，有StringIndexer，就應該有IndexToString。在應用StringIndexer對labels進行從新編號後，帶着這些編號後的label對數據進行了訓練，並接着對其餘數據進行了預測，獲得預測結果，預測結果的label也是從新編號過的，所以須要轉換回來。見下面例子，轉換回來的convetedPrediction才和原始的label對應。編碼

         Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. 
A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString.

IndexToString converter=new IndexToString() .setInputCol("prediction")//Spark默認預測label行
                .setOutputCol("convetedPrediction")//轉換回來的預測label
                .setLabels(labelIndexerModel.labels());//須要指定前面建好相互相互模型
Pipeline pipeline=new Pipeline() .setStages(new PipelineStage[] {labelIndexerModel, featureIndexerModel, dtClassifier, converter}); pipeline.fit(rawData).transform(rawData) .select("label","prediction","convetedPrediction").show(20,false); |label|prediction|convetedPrediction|
+-----+----------+------------------+
|3.0  |0.0       |3.0               |
|4.0  |1.0       |2.0               |
|1.0  |2.0       |1.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|3.0  |0.0       |3.0               |
|2.0  |1.0       |2.0               |
|1.0  |2.0       |1.0               |
+-----+----------+------------------+

`離散<->連續特徵或Label相互轉換`

`oneHotEncoder`

獨熱編碼將類別特徵（離散的，已經轉換爲數字編號形式），映射成獨熱編碼。這樣在諸如Logistic迴歸這樣須要連續數值值做爲特徵輸入的分類器中也可使用類別（離散）特徵。spa

獨熱編碼即 One-Hot 編碼，又稱一位有效編碼，其方法是使用N位 狀態寄存 器來對N個狀態進行編碼，每一個狀態都由他獨立的寄存器 位，而且在任意 時候，其 中只有一位有效。 例如： 天然狀態碼爲：000,001,010,011,100,101 獨熱編碼爲：000001,000010,000100,001000,010000,100000 能夠這樣理解，對於每個特徵，若是它有m個可能值，那麼通過獨 熱編碼 後，就變成了m個二元特徵。而且，這些特徵互斥，每次只有 一個激活。因 此，數據會變成稀疏的。 這樣作的好處主要有： 解決了分類器很差處理屬性數據的問題 在必定程度上也起到了擴充特徵的做用

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features..net

//onehotencoder前須要轉換爲string->numerical
        Dataset indexedDf=new StringIndexer() .setInputCol("category") .setOutputCol("indexCategory") .fit(df) .transform(df); //對隨機分佈的類別進行OneHotEncoder，轉換後能夠當成連續數值輸入
        Dataset coderDf=new OneHotEncoder() .setInputCol("indexCategory") .setOutputCol("ontHotCategory")//不須要fit 
                        .transform(indexedDf);

`Bucketizer`

分箱（分段處理）：將連續數值轉換爲離散類別好比特徵是年齡，是一個連續數值，須要將其轉換爲離散類別(未成年人、青年人、中年人、老年人），就要用到Bucketizer了。分類的標準是本身定義的，在Spark中爲split參數,定義以下： double[] splits = {0, 18, 35,50， Double.PositiveInfinity} 將數值年齡分爲四類0-18，18-35，35-50，55+四個段。若是左右邊界拿不許，就設置爲，Double.NegativeInfinity， Double.PositiveInfinity，不會有錯的。code

Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users.

// double[] splits={0,18,35,55,Double.POSITIVE_INFINITY};Dataset bucketDf=new Bucketizer() .setInputCol("ages") .setOutputCol("bucketCategory") .setSplits(splits)//設置分段標準
 .transform(df); //輸出 /* +---+----+--------------+ |id |ages|bucketCategory| +---+----+--------------+ |0.0|2.0 |0.0 | |1.0|67.0|3.0 | |2.0|36.0|2.0 | |3.0|14.0|0.0 | |4.0|5.0 |0.0 | |5.0|98.0|3.0 | |6.0|65.0|3.0 | |7.0|23.0|1.0 | |8.0|37.0|2.0 | |9.0|76.0|3.0 | +---+----+--------------+ */

`QuantileDiscretizer`

分位樹爲數離散化，和Bucketizer（分箱處理）同樣也是：將連續數值特徵轉換爲離散類別特徵。實際上Class QuantileDiscretizer extends （繼承自） Class（Bucketizer）。orm

參數1：不一樣的是這裏再也不本身定義splits（分類標準），而是定義分幾箱(段）就能夠了。QuantileDiscretizer本身調用函數計算分位數，並完成離散化。 -參數2：另一個參數是精度，若是設置爲0，則計算最精確的分位數，這是一個高時間代價的操做。另外上下邊界將設置爲正負無窮，覆蓋全部實數範圍。blog

QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the numBuckets parameter.
 The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile for a detailed description). 
The precision of the approximation can be controlled with the relativeError parameter. When set to zero, exact quantiles are calculated (Note: Computing exact quantiles is an expensive operation). 
The lower and upper bin bounds will be -Infinity and +Infinity covering all real values.

new QuantileDiscretizer() .setInputCol("ages") .setOutputCol("qdCategory") .setNumBuckets(4)//設置分箱數
             .setRelativeError(0.1)//設置precision-控制相對偏差
 .fit(df) .transform(df) .show(10,false); //例子：
+---+----+----------+
|id |ages|qdCategory|
+---+----+----------+
|0.0|2.0 |0.0       |
|1.0|67.0|3.0       |
|2.0|36.0|2.0       |
|3.0|14.0|1.0       |
|4.0|5.0 |0.0       |
|5.0|98.0|3.0       |
|6.0|65.0|2.0       |
|7.0|23.0|1.0       |
|8.0|37.0|2.0       |
|9.0|76.0|3.0       |
+---+----+----------+