Sparkml庫標籤和索引之間轉化

時間 2021-08-14

標籤 apache ide 編碼 spa 翻譯 orm 排序索引字符串欄目 Apache 简体版

原文原文鏈接

StringIndexerapache

StringIndexer將一串字符串標籤編碼爲一列標籤索引。這些索引範圍是[0, numLabels)按照標籤頻率排序，所以最頻繁的標籤得到索引0。若是用戶選擇保留它們，那麼看不見的標籤將被放在索引numLabels處。若是輸入列是數字，咱們將其轉換爲字符串值並將爲其建索引。當下遊管道組件（例如Estimator或 Transformer使用此字符串索引標籤）時，必須將組件的輸入列設置爲此字符串索引列名稱。在許多狀況下，您能夠使用設置輸入列setInputCol。ide

例1，假如咱們有下面的DataFrame，帶有id和category列：編碼

Idspa	category翻譯
0orm	a排序
1索引	bip
2字符串	c
3	a
4	a
5	c

對着個Dataframe使用StringIndexer，輸入列式category，categoryIndex做爲輸出列，獲得以下值：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	a	0.0
4	a	0.0
5	c	1.0

字符a，索引值是0，緣由是a出現的頻率最高，接着就是c：1，b：2。

另外，對於不可見的標籤，StringIndexer有是三種處理策略：

1，拋出異常，這是默認行爲

2，跳過不可見的標籤

3，把不可見的標籤，標記爲numLabels(這個是無用的)。

還用上面的例子，數據以下：

Id	Category
0	a
1	b
2	c
3	a
4	a
5	c
6	d
7	e

若是你沒有設置StringIndexer如何處理這些不可見的詞，或者設置爲了error，他將會拋出一個異常。然而，你若是設置setHandleInvalid("skip")，將會獲得以下結果：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0

注意，包含d,e的行並無出現。

若是，調用setHandleInvalid("keep")，會獲得下面的結果：

Id	Category	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	d	3.0
4	e	3.0

注意，d，e得到的索引值是3.0

代碼用例以下：

import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(

Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))

).toDF("id", "category")

val indexer = new StringIndexer()

.setInputCol("category")

.setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

indexed.show()

IndexToString

對稱地StringIndexer，IndexToString將一列標籤索引映射回包含做爲字符串的原始標籤的列。一個常見的用例是從標籤生成索引StringIndexer，用這些索引對模型進行訓練，並從預測索引列中檢索原始標籤IndexToString。可是，您能夠自由提供本身的標籤。

例如，假如咱們有dataframe格式以下：

Id	CategoryIndex
0	0.0
1	2.0
2	1.0
3	0.0
4	0.0
5	1.0

使用IndexToString 而且使用categoryIndex做爲輸入列，originalCategory做爲輸出列，能夠檢索到原始標籤以下：

Id	originalCategory	CategoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	a	0.0
4	a	0.0
5	c	1.0

代碼案例以下：

import org.apache.spark.ml.attribute.Attribute

import org.apache.spark.ml.feature.{IndexToString, StringIndexer}

val df = spark.createDataFrame(Seq(

(0, "a"),

(1, "b"),

(2, "c"),

(3, "a"),

(4, "a"),

(5, "c")

)).toDF("id", "category")

val indexer = new StringIndexer()

.setInputCol("category")

.setOutputCol("categoryIndex")

.fit(df)

val indexed = indexer.transform(df)

println(s"Transformed string column '${indexer.getInputCol}' " +

s"to indexed column '${indexer.getOutputCol}'")

indexed.show()

val inputColSchema = indexed.schema(indexer.getOutputCol)

println(s"StringIndexer will store labels in output column metadata: " +

s"${Attribute.fromStructField(inputColSchema).toString} ")

val converter = new IndexToString()

.setInputCol("categoryIndex")

.setOutputCol("originalCategory")

val converted = converter.transform(indexed)

println(s"Transformed indexed column '${converter.getInputCol}' back to original string " +

s"column '${converter.getOutputCol}' using labels in metadata")

converted.select("id", "categoryIndex", "originalCategory").show()

本文主要參考翻譯整理自Spark官網，打原創標籤純屬爲了保證，翻譯勞動成果，謝謝你們諒解。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。