本套技術專欄是做者(秦凱新)平時工做的總結和昇華,經過從真實商業環境抽取案例進行總結和分享,並給出商業應用的調優建議和集羣環境容量規劃等內容,請持續關注本套博客。版權聲明:禁止轉載,歡迎學習。QQ郵箱地址:1120746959@qq.com,若有任何商業交流,可隨時聯繫。算法
能夠看到spark的特徵工程分爲如下4個方向:sql
特徵抽取 ,特徵轉換,特徵選擇,特徵轉換,Spark ML整個特徵工程架構以下圖所示:apache
1: Feature Extractors
TF-IDF
Word2Vec
CountVectorizer
FeatureHasher
2:Feature Transformers
Tokenizer
StopWordsRemover
n-gram
Binarizer
PCA
PolynomialExpansion
Discrete Cosine Transform (DCT)
StringIndexer
IndexToString
OneHotEncoder (Deprecated since 2.3.0)
OneHotEncoderEstimator
VectorIndexer
Interaction
Normalizer
StandardScaler
MinMaxScaler
MaxAbsScaler
Bucketizer
ElementwiseProduct
SQLTransformer
VectorAssembler
VectorSizeHint
QuantileDiscretizer
Imputer
3:Feature Selectors
VectorSlicer
RFormula
ChiSqSelector
Locality Sensitive Hashing
LSH Operations
4:Feature Transformation
LSH Operations:
Feature Transformation
Approximate Similarity Join
Approximate Nearest Neighbor Search
LSH Algorithms:
Bucketed Random Projection for Euclidean Distance
MinHash for Jaccard Distance
複製代碼
詞頻-逆向文件頻率」(TF-IDF)是一種在文本挖掘中普遍使用的特徵向量化方法,它能夠體現一個文檔中詞語在語料庫中的重要程度。架構
詞語由t表示,文檔由d表示,語料庫由D表示。詞頻TF(t,d)是詞語t在文檔d中出現的次數。文件頻率DF(t,D)是包含詞語的文檔的個數。若是咱們只使用詞頻來衡量重要性,很容易過分強調在文檔中常常出現,卻沒有太多實際信息的詞語,好比「a」,「the」以及「of」。若是一個詞語常常出如今語料庫中,意味着它並不能很好的對文檔進行區分。TF-IDF就是在數值化文檔信息,衡量詞語能提供多少信息以區分文檔。其定義以下: IDF(t,D)=log|D|+1DF(t,D)+1 此處|D| 是語料庫中總的文檔數。公式中使用log函數,當詞出如今全部文檔中時,它的IDF值變爲0。加1是爲了不分母爲0的狀況。TF-IDF 度量值表示以下: TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D) 在Spark ML庫中,TF-IDF被分紅兩部分:TF (+hashing) 和 IDF。app
TF: HashingTF 是一個Transformer,在文本處理中,接收詞條的集合而後把這些集合轉化成固定長度的特徵向量。這個算法在哈希的同時會統計各個詞條的詞頻。dom
IDF: IDF是一個Estimator,在一個數據集上應用它的fit()方法,產生一個IDFModel。 該IDFModel 接收特徵向量(由HashingTF產生),而後計算每個詞在文檔中出現的頻次。IDF會減小那些在語料庫中出現頻率較高的詞的權重。機器學習
Spark.mllib 中實現詞頻率統計使用特徵hash的方式,原始特徵經過hash函數,映射到一個索引值。後面只須要統計這些索引值的頻率,就能夠知道對應詞的頻率。這種方式避免設計一個全局1對1的詞到索引的映射,這個映射在映射大量語料庫時須要花費更長的時間。但須要注意,經過hash的方式可能會映射到同一個值的狀況,即不一樣的原始特徵經過Hash映射後是同一個值。爲了下降這種狀況出現的機率,咱們只能對特徵向量升維。i.e., 提升hash表的桶數,默認特徵維度是 2^20 = 1,048,576.ide
接下來以一組句子開始。首先使用分解器Tokenizer把句子劃分爲單個詞語。對每個句子(詞袋),咱們使用HashingTF將句子轉換爲特徵向量,最後使用IDF從新調整特徵向量。這種轉換一般能夠提升使用文本特徵的性能。函數
import org.apache.spark.sql.SparkSession
import spark.implicits._
val sentenceData = spark.createDataFrame(Seq(
| (0, "I I I Spark and I I Spark"),
| (0, "I wish wish wish wish wish classes"),
| (1, "Logistic regression regression regression regression regression I love it ")
| )).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
wordsData.show(false)
+-----+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------+
|label|sentence |words |
+-----+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------+
|0 |I I I Spark and I I Spark |[i, i, i, spark, and, i, i, spark] |
|0 |I wish wish wish wish wish classes |[i, wish, wish, wish, wish, wish, classes] |
|1 |Logistic regression regression regression regression regression I love it |[logistic, regression, regression, regression, regression, regression, i, love, it]|
+-----+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------+
獲得分詞後的文檔序列後,便可使用HashingTF的transform()方法把句子哈希成特徵向量,這裏設置哈希表的桶數爲2000。
val hashingTF = new HashingTF().
| setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(2000)
val featurizedData = hashingTF.transform(wordsData)
能夠看到,分詞序列被變換成一個稀疏特徵向量,其中每一個單詞都被散列成了一個不一樣的索引值,特徵向量在某一維度上的值即該詞彙在文檔中出現的次數。
featurizedData.select("rawFeatures").show(false)
+----------------------------------------------------+
|rawFeatures |
+----------------------------------------------------+
|(2000,[333,1105,1329],[1.0,2.0,5.0]) |
|(2000,[495,1329,1809],[5.0,1.0,1.0]) |
|(2000,[240,495,695,1329,1604],[1.0,1.0,5.0,1.0,1.0])|
+----------------------------------------------------+
featurizedData.rdd.foreach(println)
[0,I I I Spark and I I Spark,WrappedArray(i, i, i, spark, and, i, i, spark),(2000,[333,1105,1329],[1.0,2.0,5.0])]
[0,I wish wish wish wish wish classes,WrappedArray(i, wish, wish, wish, wish, wish, classes),(2000,[495,1329,1809],[5.0,1.0,1.0])]
[1,Logistic regression regression regression regression regression I love it ,WrappedArray(logistic, regression, regression, regression, regression, regression, i, love, it),(2000,[240,495,695,1329,1604],[1.0,1.0,5.0,1.0,1.0])]
複製代碼
真是太難搞懂了,辛虧我改變了案例,發現i的hash值爲1329,發如今"I I I Spark and I I Spark"出現了5次。即(2000,[333,1105,1329],[1.0,2.0,5.0])。所以,1329也即全局惟一了。性能
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
scala> val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
idf: org.apache.spark.ml.feature.IDF = idf_18dec771e2e0
使用IDF來對單純的詞頻特徵向量進行修正,使其更能體現不一樣詞彙對文本的區別能力,IDF是一個Estimator,調用fit()方法並將詞頻向量傳入,即產生一個IDFModel
scala> val idfModel = idf.fit(featurizedData)
idfModel: org.apache.spark.ml.feature.IDFModel = idf_18dec771e2e0
IDFModel是一個Transformer,調用它的transform()方法,便可獲得每個單詞對應的TF-IDF度量值。
scala> val rescaledData = idfModel.transform(featurizedData)
rescaledData: org.apache.spark.sql.DataFrame = [label: int, sentence: string ... 3 more fields]
特徵向量已經被其在語料庫中出現的總次數進行了修正,經過TF-IDF獲得的特徵向量,在接下來能夠被應用到相關的機器學習方法中。
scala> rescaledData.select("features", "label").take(3).foreach(println)
[(2000,[333,1105,1329],[0.6931471805599453,1.3862943611198906,0.0]),0]
[(2000,[495,1329,1809],[1.4384103622589042,0.0,0.6931471805599453]),0]
[(2000,[240,495,695,1329,1604],[0.6931471805599453,0.28768207245178085,3.4657359027997265,0.0,0.6931471805599453]),1]
scala> rescaledData.rdd.foreach(println)
進一步獲得詳細對比:
[0,I I I Spark and I I Spark,WrappedArray(i, i, i, spark, and, i, i, spark),(2000,[333,1105,1329],[1.0,2.0,5.0]),(2000,[333,1105,1329],[0.6931471805599453,1.3862943611198906,0.0])]
[0,I wish wish wish wish wish classes,WrappedArray(i, wish, wish, wish, wish, wish, classes),(2000,[495,1329,1809],[5.0,1.0,1.0]),(2000,[495,1329,1809],[1.4384103622589042,0.0,0.6931471805599453])]
[1,Logistic regression regression regression regression regression I love it ,WrappedArray(logistic, regression, regression, regression, regression, regression, i, love, it),(2000,[240,495,695,1329,1604],[1.0,1.0,5.0,1.0,1.0]),(2000,[240,495,695,1329,1604],[0.6931471805599453,0.28768207245178085,3.4657359027997265,0.0,0.6931471805599453])]
複製代碼
發現 「I I I Spark and I I Spark」 句子中單詞 i 在全部文章中都出現,因此其TF-IDF值爲 0,發現spark出現了兩次因此其TF-IDF值爲1.3862943611198906。
Word2vec是一個Estimator,它採用一系列表明文檔的詞語來訓練word2vecmodel。該模型將每一個詞語映射到一個固定大小的向量。word2vecmodel使用文檔中每一個詞語的平均數來將文檔轉換爲向量,而後這個向量能夠做爲預測的特徵,來計算文檔類似度計算等等。
ml庫中,Word2vec 的實現使用的是skip-gram模型。Skip-gram的訓練目標是學習詞表徵向量分佈,其優化目標是在給定中心詞的詞向量的狀況下,最大化如下似然函數:
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
val documentDF = spark.createDataFrame(Seq(
"Hi I I I Spark Spark".split(" "),
"I wish wish wish wish wish wish".split(" "),
"Logistic regression".split(" ")
).map(Tuple1.apply)).toDF("text")
val word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(3).setMinCount(0)
val model = word2Vec.fit(documentDF)
文檔被轉變爲了一個3維的特徵向量,這些特徵向量就能夠被應用到相關的機器學習方法
scala> result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
| println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
Text: [Hi, I, I, I, Spark, , Spark] =>
Vector: [-0.07306859535830361,-0.02478547128183501,-0.010775725756372723]
Text: [I, wish, wish, wish, wish, wish, wish] =>
Vector: [-0.033820114231535366,-0.13763525443417685,0.14657753705978394]
Text: [Logistic, regression] =>
Vector: [-0.10231713205575943,0.0494652334600687,0.014658251544460654]
複製代碼
CountVectorizer旨在經過計數來將一個文檔轉換爲向量。當不存在先驗字典時,Countvectorizer做爲Estimator提取詞彙進行訓練,並生成一個CountVectorizerModel用於存儲相應的詞彙向量空間。
import spark.implicits._
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array( "b", "b", "c", "a"))
)).toDF("id", "words")
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(3).setMinDF(2).fit(df)
在訓練結束後,能夠經過CountVectorizerModel的vocabulary成員得到到模型的詞彙表:
scala> cvModel.vocabulary
res46: Array[String] = Array(b, a, c)
cvModel.transform(df).show(false)
+---+------------+-------------------------+
|id |words |features |
+---+------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[b, b, c, a]|(3,[0,1,2],[2.0,1.0,1.0])|
+---+------------+-------------------------+
val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words").setOutputCol("features")
cvModel.transform(df).show(false)
+---+------------+-------------------------+
|id |words |features |
+---+------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[b, b, c, a]|(3,[0,1,2],[2.0,1.0,1.0])|
+---+------------+-------------------------+
複製代碼
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.
val df = Seq(
(2.0, true, "1", "foo"),
(3.0, false, "2", "bar")
).toDF("real", "bool", "stringNum", "string")
val hasher = new FeatureHasher().setInputCols("real", "bool", "stringNum", "string").setOutputCol("features")
hasher.transform(df).show(false)
+----+-----+---------+------+--------------------------------------------------------+
|real|bool |stringNum|string|features |
+----+-----+---------+------+--------------------------------------------------------+
|2.0 |true |1 |foo |(262144,[174475,247670,257907,262126],[2.0,1.0,1.0,1.0])|
|3.0 |false|2 |bar |(262144,[70644,89673,173866,174475],[1.0,1.0,1.0,3.0]) |
+----+-----+---------+------+--------------------------------------------------------+
複製代碼
匆匆結束本文特徵提取專題,特徵轉換纔是重頭戲。後續更精彩。
秦凱新 於深圳