【原創】大叔算法分享(5)聚類算法DBSCAN

一 簡介

DBSCAN:Density-based spatial clustering of applications with noisehtml

is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.
python

 

二 原理

DBSCAN是一種基於密度的聚類算法,算法過程比較簡單,即將相距較近的點(中心點和它的鄰居點)聚成一個cluster,而後不斷找鄰居點的鄰居點並加到這個cluster中,直到cluster沒法再擴大,而後再處理其餘未訪問的點;git

三 算法僞代碼

 子方法僞代碼github

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minPts).算法

DBSCAN算法主要有兩個參數,一個是距離Eps,一個是最小鄰居的數量MinPts,即在中心點半徑Eps以內的鄰居點數量超過MinPts時,中心點和鄰居點才能夠組成一個cluster;app

四 應用代碼實現

python

示例代碼ide

def main_fun():
    loc_data = [(40.8379295833, -73.70228875), (40.750613794,-73.993434906), (40.6927066969, -73.8085984165), (40.7489736586, -73.9859616017), (40.8379525833, -73.70209875), (40.6997066969, -73.8085234165), (40.7484436586, -73.9857316017)]
    epsilon = 10
    db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(loc_data))
    labels = db.labels_
    print(labels)
    print(db.core_sample_indices_)
    print(db.components_)
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    for i in range(0, n_clusters_):
        print(i)
        indexs = np.where(labels == i)
        for j in indexs:
            print(loc_data[j])

if __name__ == '__main__':
    main_fun()

主要結果說明ui

core_sample_indices_  array, shape = [n_core_samples]

Indices of core samples.idea

components_  array, shape = [n_core_samples, n_features]

Copy of each core sample found by training.spa

labels_  array, shape = [n_samples]

Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.

 詳見官方文檔:https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

scala

依賴

<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>nak_2.11</artifactId>
<version>1.3</version>
</dependency>
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.11</artifactId>
<version>0.13</version>
</dependency>

示例代碼

import breeze.linalg.DenseMatrix
import nak.cluster.{DBSCAN, GDBSCAN, Kmeans}

    val matrix = DenseMatrix(
      (40.8379295833, -73.70228875),
      (40.6927066969, -73.8085984165),
      (40.7489736586, -73.9859616017),
      (40.8379525833, -73.70209875),
      (40.6997066969, -73.8085234165),
      (40.7484436586, -73.9857316017),
      (40.750613794,-73.993434906))

    val gdbscan = new GDBSCAN(
      DBSCAN.getNeighbours(epsilon = 1000.0, distance = Kmeans.euclideanDistance),
      DBSCAN.isCorePoint(minPoints = 1)
    )
    val clusters = gdbscan cluster matrix
    clusters.foreach(cluster => {
        println(cluster.id + ", " + cluster.points.length)
        cluster.points.foreach(p => p.value.data.foreach(println))
      })

詳見官方文檔:https://github.com/scalanlp/nak 

 

算法細節詳見參考

參考:A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

 

其餘:

http://www.cs.fsu.edu/~ackerman/CIS5930/notes/DBSCAN.pdf

https://www.oreilly.com/ideas/clustering-geolocated-data-using-spark-and-dbscan

相關文章
相關標籤/搜索