二分K均值(bisecting k-means)算法

時間 2019-11-10

原文原文鏈接

二分K均值(bisecting k-means)算法--good

http://m.blog.csdn.net/blog/hwwn2009/38312613html

機器學習算法與Python實踐之（六）二分k均值聚類 http://blog.csdn.net/zouxy09/article/details/17590137web

算法主要分爲如下步驟，一開始是把全部數據初始化爲一個cluster,第二步從全部cluster 中選其中一個出來用基本k-means算法（k設爲2）再劃分紅兩個cluster(初始時只有一個cluster).而後是一直重複第二步的劃分（選一個cluster劃成兩個）直到獲得k個cluster算法中止。算法

從算法能夠知道每次劃分都是用k-means劃分，但是問題是從已有的cluster中應該選哪一個cluster出來進行劃分呢。從網上較多人轉載Michael Steinbach的文章「A Comparison of Document Clustering Techniques」摘抄如下算法：機器學習

Basic Bisecting K-means Algorithm for finding K clusters.
1. Pick a cluster to split.
2. Find 2 sub-clusters using the basic K-means algorithm. (Bisecting step)
3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the
clustering with the highest overall similarity.
4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached.學習

在這篇文章中，做者說到了選取cluster有兩種策略，第一種就是每次選的時候，都對已有的cluster計算偏差和（），而後選一個SSE最大的一個cluster來進行劃分。第二種是每次都挑數據最多的那個cluster來進行劃分。通常都是採起第一種策略。除此之外，每次不僅用一次k-means來劃分，而是預先設置一個ITER 值，而後對這個cluster進行Iter次循環執行k-means算法。由於k-means每次一開始都是隨機選k個質心來執行，因此通常來講 ITER次執行k-means，每次都會獲得不一樣的兩個cluster。那麼應該選哪對cluster來做爲劃分之後的cluster呢？答案就是在每次循環中，每次都計算經過當次k-means劃分出來的兩個cluster的SSE和,那麼最後就選SSE和最少的那對cluster做爲劃分之後的cluster。url