Spark Java API 之 CountVectorizer

因爲在Spark中文本處理與分析的一些機器學習算法的輸入並非文本數據，而是數值型向量。所以，須要進行轉換。而將文本數據轉換成數值型的向量有不少種方法，CountVectorizer是其中之一。html

A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents.java

在構建向量時，有兩個重要的參數：VocabSize和MinDF。前者表示詞典的大小，後者表示當文檔中某個Term出現的次數小於MinDF時，則不計入詞典（該Term不屬於詞典中的單詞）。算法

好比說如今有兩篇文檔：【"w1", "w2", "w4", "w5", "w2"】，【"w1", "w2", "w3"】sql

CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")
                .setVocabSize(3).setMinDF(2);

根據上面代碼中的參數設置，詞典大小爲3，即一共能夠有三個Term。因爲在全部的文檔中，"w1"出現2次，"w2"出現2次，所以計入詞典。而"w3"、"w4"、"w5"只出現一次，不屬於詞典中的單詞(Term)。以下圖所示：詞典中只有兩個Termapache

When the dictionary is not defined CountVectorizer iterates over the dataset twice to prepare the dictionary based on frequency and size.app

CountVectorizer 首先掃描Dataset（文本數據）生成詞典，而後再次掃描生成向量模型（CountVectorizerModel）機器學習

在構造Dataset 時，須要指定模式。用模式來解釋Dataset中每一行的數據。ide

StructType schema = new StructType(new StructField[]{
                new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });

A field inside a StructType. param: name The name of this field. param: dataType The data type of this field. param: nullable Indicates if values of this field can be null values. param: metadata The metadata of this field. The metadata should be preserved during transformation if the content of the column is not modified學習

第一個參數是：名稱；第二個參數是dataType 數據類型；第三個參數是標識該字段的值是否能夠爲空；第四個參數爲字段的元數據信息。ui

整個示例代碼：

import org.apache.spark.ml.feature.CountVectorizer;
import org.apache.spark.ml.feature.CountVectorizerModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;

import java.util.Arrays;
import java.util.List;

public class CounterVectorExample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder().appName("CountVectorizer").master("spark://172.25.129.170:7077").getOrCreate();
        List<Row> data = Arrays.asList(
//                RowFactory.create(Arrays.asList("a", "b", "c")),
//                RowFactory.create(Arrays.asList("a", "b", "b", "c", "a")),
//                RowFactory.create(Arrays.asList("a", "b", "a", "b"))
                RowFactory.create(Arrays.asList("w1", "w2", "w3")),
                RowFactory.create(Arrays.asList("w1", "w2", "w4", "w5", "w2"))
        );
        StructType schema = new StructType(new StructField[]{
                new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        Dataset<Row> df = spark.createDataFrame(data, schema);
        CountVectorizer cv = new CountVectorizer().setInputCol("text").setOutputCol("feature")
                .setVocabSize(3).setMinDF(2);
        CountVectorizerModel cvModel = cv.fit(df);

        //prior dictionary
        CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"a", "b", "c"}).setInputCol("text")
                .setOutputCol("feature");

//        cvm.
        cvModel.transform(df).show(false);
        spark.stop();
    }
}

輸出結果默認是以稀疏向量表示：

A sparse vector represented by an index array and a value array.

param: size size of the vector. param: indices index array, assume to be strictly increasing. param: values value array, must have the same length as the index array.

第一個字段表明：向量長度，因爲這裏詞典中只有2個Term，所以轉換出來的向量長度爲2；第二個字段：索引下標；第三個字段：索引位置處相應的向量元素值。由上圖中位置0處的Term是 w2，位置1處的Term是w1，所以，輸出：

固然，咱們也能夠預先定義詞典：在構造CountVectorizerModel的時候指定詞典：【"w1", "w2", "w3"】

//prior dictionary
        CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"w1", "w2", "w3"}).setInputCol("text").setOutputCol("feature");
        cvm.transform(df).show(false);

對於文本：[w1,w2,w3]，每一個Term都在詞典中，且出現了一次，所以稀疏特徵向量表示爲：(3,[0,1,2],[1.0,1.0,1.0])。其中，3表明向量的長度爲3維向量；[0,1,2]表示向量的索引；[1.0,1.0,1.0]表示，在相應的索引處，每一個元素值爲1.0（即各個Term只出現了一次）。而對於文本[w1, w2, w4, w5, w2]，由於w4和w5不在詞典中，w1出現一次，w2出現2次，故其特徵以下：

能夠看出：對於CountVectorizerModel，向量長度就是詞典的大小。

系列文章：

spark JAVA 開發環境搭建及遠程調試

原文：https://www.cnblogs.com/hapjin/p/9899164.html