原文連接:http://tecdat.cn/r語言實現:混合正態分佈em最大指望估計法/算法
由於近期在分析數據時用到了EM最大指望估計法這個算法,在參數估計中也用到的比較多。然而,發現國內在R軟件上實現高斯混合分佈的EM的實例並很少,大多數是關於1到2個高斯混合分佈的實現,不易於推廣,所以這裏分享一下本身編寫的k個高斯混合分佈的EM算法實現請大神們多多指教。並結合EMCluster包對結果進行驗算。函數
本文使用的密度函數爲下面格式:大數據
對應的函數原型爲 em.norm(x,means,covariances,mix.prop)ui
x爲原數據,means爲初始均值,covariances爲數據的協方差矩陣,mix.prop爲混合參數初始值。spa
使用的數據爲MASS包裏面的synth.te數據的前兩列orm
x <- synth.te[,-3]blog
首先安裝須要的包,並讀取原數據。rem
install.packages("MASS")get
library(MASS)原型
install.packages("EMCluster")
library(EMCluster)
install.packages("ggplot2")
library(ggplot2)
Y=synth.te[,c(1:2)]
qplot(x=xs, y=ys, data=Y)
而後繪製相應的變量相關圖:
從圖上咱們能夠大概估計出初始的平均點爲(-0.7,0.4) (-0.3,0.8)(0.5,0.6)
固然爲了試驗的嚴謹性,我能夠從兩個初始均值點的狀況開始估計
首先輸入初始參數:
mustart = rbind(c(-0.5,0.3),c(0.4,0.5))
covstart = list(cov(Y), cov(Y))
probs = c(.01, .99)
而後編寫em.norm函數,注意其中的clusters值須要根據不一樣的初始參數進行修改,
em.norm = function(X,mustart,covstart,probs){
params = list(mu=mustart, var=covstart, probs=probs)
clusters = 2
tol=.00001
maxits=100
showits=T
require(mvtnorm)
N = nrow(X)
mu = params$mu
var = params$var
probs = params$probs
ri = matrix(0, ncol=clusters, nrow=N)
ll = 0
it = 0
converged = FALSE
if (showits)
cat(paste("Iterations of EM:", "\n"))
while (!converged & it < maxits) {
probsOld = probs
llOld = ll
riOld = ri
# Compute responsibilities
for (k in 1:clusters){
ri[,k] = probs[k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=F)
}
ri = ri/rowSums(ri)
rk = colSums(ri)
probs = rk/N
for (k in 1:clusters){
varmat = matrix(0, ncol=ncol(X), nrow=ncol(X))
for (i in 1:N){
varmat = varmat + ri[i,k] * X[i,]%*%t(X[i,])
}
mu[k,] = (t(X) %*% ri[,k]) / rk[k]
var[[k]] = varmat/rk[k] - mu[k,]%*%t(mu[k,])
ll[k] = -.5 * sum( ri[,k] * dmvnorm(X, mu[k,], sigma = var[[k]], log=T) )
}
ll = sum(ll)
parmlistold = c(llOld, probsOld)
parmlistcurrent = c(ll, probs)
it = it + 1
if (showits & it == 1 | it%%5 == 0)
cat(paste(format(it), "...", "\n", sep = ""))
converged = min(abs(parmlistold - parmlistcurrent)) <= tol
}
clust = which(round(ri)==1, arr.ind=T)
clust = clust[order(clust[,1]), 2]
out = list(probs=probs, mu=mu, var=var, resp=ri, cluster=clust, ll=ll)
}
結果,能夠用圖像化來表示:
qplot(x=xs, y=ys, data=Y)
ggplot(aes(x=xs, y=ys), data=Y) +
geom_point(aes(color=factor(test$cluster)))
相似的其餘狀況這裏不呈現了,另外r語言提供了EMCluster包能夠比較方便的實現EM進行參數估計和結果的偏差分析。
ret <- init.EM(Y, nclass = 2)
em.aic(x=Y,emobj=list(pi = ret$pi, Mu = ret$Mu, LTSigma = ret$LTSigma))#計算結果的AIC
經過比較不一樣狀況的AIC,咱們能夠篩選出適合的聚類數參數值。