Spark中的聚類算法

Spark - Clustering

官方文檔:https://spark.apache.org/docs/2.2.0/ml-clustering.htmlhtml

這部分介紹MLlib中的聚類算法;python

目錄:算法

  • K-means:
    • 輸入列;
    • 輸出列;
  • Latent Dirichlet allocation(LDA):
  • Bisecting k-means;
  • Gaussian Mixture Model(GMM):
    • 輸入列;
    • 輸出列;

K-means

k-means是最經常使用的聚類算法之一,它將數據彙集到預先設定的N個簇中;apache

KMeans做爲一個預測器,生成一個KMeansModel做爲基本模型;lua

輸入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

輸出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center

例子

from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

LDA

LDA是一個預測器,同時支持EMLDAOptimizer和OnlineLDAOptimizer,生成一個LDAModel做爲基本模型,專家使用者若是有須要能夠將EMLDAOptimizer生成的LDAModel轉爲DistributedLDAModel;spa

from pyspark.ml.clustering import LDA

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)

ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)

Bisecting k-means

Bisecting k-means是一種使用分裂方法的層次聚類算法:全部數據點開始都處在一個簇中,遞歸的對數據進行劃分直到簇的個數爲指定個數爲止;code

Bisecting k-means通常比K-means要快,可是它會生成不同的聚類結果;orm

BisectingKMeans是一個預測器,並生成BisectingKMeansModel做爲基本模型;htm

與K-means相比,二分K-means的最終結果不依賴於初始簇心的選擇,這也是爲何一般二分K-means與K-means結果每每不同的緣由;遞歸

from pyspark.ml.clustering import BisectingKMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model(GMM)

GMM表示一個符合分佈,從一個高斯子分佈中提取點,每一個點都有其本身 的機率,spark.ml基於給定數據經過指望最大化算法來概括最大似然模型實現算法;

輸入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

輸出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center
probabilityCol Vector probability Probability of each cluster

例子

from pyspark.ml.clustering import GaussianMixture

# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)
相關文章
相關標籤/搜索