官方文檔:https://spark.apache.org/docs/2.2.0/ml-clustering.htmlhtml
這部分介紹MLlib中的聚類算法;python
目錄:算法
k-means是最經常使用的聚類算法之一,它將數據彙集到預先設定的N個簇中;apache
KMeans做爲一個預測器,生成一個KMeansModel做爲基本模型;lua
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
from pyspark.ml.clustering import KMeans # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") # Trains a k-means model. kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) # Evaluate clustering by computing Within Set Sum of Squared Errors. wssse = model.computeCost(dataset) print("Within Set Sum of Squared Errors = " + str(wssse)) # Shows the result. centers = model.clusterCenters() print("Cluster Centers: ") for center in centers: print(center)
LDA是一個預測器,同時支持EMLDAOptimizer和OnlineLDAOptimizer,生成一個LDAModel做爲基本模型,專家使用者若是有須要能夠將EMLDAOptimizer生成的LDAModel轉爲DistributedLDAModel;spa
from pyspark.ml.clustering import LDA # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt") # Trains a LDA model. lda = LDA(k=10, maxIter=10) model = lda.fit(dataset) ll = model.logLikelihood(dataset) lp = model.logPerplexity(dataset) print("The lower bound on the log likelihood of the entire corpus: " + str(ll)) print("The upper bound on perplexity: " + str(lp)) # Describe topics. topics = model.describeTopics(3) print("The topics described by their top-weighted terms:") topics.show(truncate=False) # Shows the result transformed = model.transform(dataset) transformed.show(truncate=False)
Bisecting k-means是一種使用分裂方法的層次聚類算法:全部數據點開始都處在一個簇中,遞歸的對數據進行劃分直到簇的個數爲指定個數爲止;code
Bisecting k-means通常比K-means要快,可是它會生成不同的聚類結果;orm
BisectingKMeans是一個預測器,並生成BisectingKMeansModel做爲基本模型;htm
與K-means相比,二分K-means的最終結果不依賴於初始簇心的選擇,這也是爲何一般二分K-means與K-means結果每每不同的緣由;遞歸
from pyspark.ml.clustering import BisectingKMeans # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") # Trains a bisecting k-means model. bkm = BisectingKMeans().setK(2).setSeed(1) model = bkm.fit(dataset) # Evaluate clustering. cost = model.computeCost(dataset) print("Within Set Sum of Squared Errors = " + str(cost)) # Shows the result. print("Cluster Centers: ") centers = model.clusterCenters() for center in centers: print(center)
GMM表示一個符合分佈,從一個高斯子分佈中提取點,每一個點都有其本身 的機率,spark.ml基於給定數據經過指望最大化算法來概括最大似然模型實現算法;
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
probabilityCol | Vector | probability | Probability of each cluster |
from pyspark.ml.clustering import GaussianMixture # loads data dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") gmm = GaussianMixture().setK(2).setSeed(538009335) model = gmm.fit(dataset) print("Gaussians shown as a DataFrame: ") model.gaussiansDF.show(truncate=False)