機器學習聚類算法之DBSCAN

1、概念python

DBSCAN是一種基於密度的聚類算法,DBSCAN須要兩個參數,一個是以P爲中心的鄰域半徑;另外一個是以P爲中心的鄰域內的最低門限點的數量,即密度。算法

優勢:app

一、不須要提早設定分類簇數量,分類結果更合理;dom

二、能夠有效的過濾干擾。spa

 

缺點:code

一、對高維數據處理效果較差;blog

二、算法複雜度較高,資源消耗大於K-means。utf-8

 

2、計算資源

一、默認使用第一個點做爲初始中心;it

二、經過計算點到中心的歐氏距離和領域半徑對比,小於則是鄰域點;

三、計算完全部點,統計鄰域內點數量,小於於最低門限點數量則爲噪聲;

四、循環統計各個點的鄰域點數,只要一直大於最低門限點數量,則一直向外擴展,直到再也不大於。

五、一個簇擴展完成,會從剩下的點中重複上述操做,直到全部點都被遍歷。

 

3、實現

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt

cs = ['black', 'blue', 'brown', 'red', 'yellow', 'green']


class NpCluster(object):
    def __init__(self):
        self.key = []
        self.value = []

    def append(self, data):
        if str(data) in self.key:
            return
        self.key.append(str(data))
        self.value.append(data)

    def exist(self, data):
        if str(data) in self.key:
            return True
        return False

    def __len__(self):
        return len(self.value)

    def __iter__(self):
        self.times = 0
        return self

    def __next__(self):
        try:
            ret = self.value[self.times]
            self.times += 1
            return ret
        except IndexError:
            raise StopIteration()


def create_sample():
    np.random.seed(10)  # 隨機數種子,保證隨機數生成的順序同樣
    n_dim = 2
    num = 100
    a = 3 + 5 * np.random.randn(num, n_dim)
    b = 30 + 5 * np.random.randn(num, n_dim)
    c = 60 + 10 * np.random.randn(1, n_dim)
    data_mat = np.concatenate((np.concatenate((a, b)), c))
    ay = np.zeros(num)
    by = np.ones(num)
    label = np.concatenate((ay, by))
    return {'data_mat': list(data_mat), 'label': label}


def region_query(dataset, center_point, eps):
    result = NpCluster()
    for point in dataset:
        if np.sqrt(sum(np.power(point - center_point, 2))) <= eps:
            result.append(point)
    return result


def dbscan(dataset, eps, min_pts):
    noise = NpCluster()
    visited = NpCluster()
    clusters = []
    for point in dataset:
        cluster = NpCluster()
        if not visited.exist(point):
            visited.append(point)
            neighbors = region_query(dataset, point, eps)
            if len(neighbors) < min_pts:
                noise.append(point)
            else:
                cluster.append(point)
                expand_cluster(visited, dataset, neighbors, cluster, eps, min_pts)
                clusters.append(cluster)
    for data in clusters:
        print(data.value)
        plot_data(np.mat(data.value), cs[clusters.index(data)])
    if noise.value:
        plot_data(np.mat(noise.value), 'green')
    plt.show()


def plot_data(samples, color, plot_type='o'):
    plt.plot(samples[:, 0], samples[:, 1], plot_type, markerfacecolor=color, markersize=14)


def expand_cluster(visited, dataset, neighbors, cluster, eps, min_pts):
    for point in neighbors:
        if not visited.exist(point):
            visited.append(point)
            point_neighbors = region_query(dataset, point, eps)
            if len(point_neighbors) >= min_pts:
                for expand_point in point_neighbors:
                    if not neighbors.exist(expand_point):
                        neighbors.append(expand_point)
                if not cluster.exist(point):
                    cluster.append(point)


init_data = create_sample()
dbscan(init_data['data_mat'], 10, 3)

聚類結果:

能夠看到,點被很好的聚類爲兩個簇,右上角是噪聲。

相關文章
相關標籤/搜索