1、概念python
DBSCAN是一種基於密度的聚類算法,DBSCAN須要兩個參數,一個是以P爲中心的鄰域半徑;另外一個是以P爲中心的鄰域內的最低門限點的數量,即密度。算法
優勢:app
一、不須要提早設定分類簇數量,分類結果更合理;dom
二、能夠有效的過濾干擾。spa
缺點:code
一、對高維數據處理效果較差;blog
二、算法複雜度較高,資源消耗大於K-means。utf-8
2、計算資源
一、默認使用第一個點做爲初始中心;it
二、經過計算點到中心的歐氏距離和領域半徑對比,小於則是鄰域點;
三、計算完全部點,統計鄰域內點數量,小於於最低門限點數量則爲噪聲;
四、循環統計各個點的鄰域點數,只要一直大於最低門限點數量,則一直向外擴展,直到再也不大於。
五、一個簇擴展完成,會從剩下的點中重複上述操做,直到全部點都被遍歷。
3、實現
#!/usr/bin/env python # -*- coding: utf-8 -*- import numpy as np import matplotlib.pyplot as plt cs = ['black', 'blue', 'brown', 'red', 'yellow', 'green'] class NpCluster(object): def __init__(self): self.key = [] self.value = [] def append(self, data): if str(data) in self.key: return self.key.append(str(data)) self.value.append(data) def exist(self, data): if str(data) in self.key: return True return False def __len__(self): return len(self.value) def __iter__(self): self.times = 0 return self def __next__(self): try: ret = self.value[self.times] self.times += 1 return ret except IndexError: raise StopIteration() def create_sample(): np.random.seed(10) # 隨機數種子,保證隨機數生成的順序同樣 n_dim = 2 num = 100 a = 3 + 5 * np.random.randn(num, n_dim) b = 30 + 5 * np.random.randn(num, n_dim) c = 60 + 10 * np.random.randn(1, n_dim) data_mat = np.concatenate((np.concatenate((a, b)), c)) ay = np.zeros(num) by = np.ones(num) label = np.concatenate((ay, by)) return {'data_mat': list(data_mat), 'label': label} def region_query(dataset, center_point, eps): result = NpCluster() for point in dataset: if np.sqrt(sum(np.power(point - center_point, 2))) <= eps: result.append(point) return result def dbscan(dataset, eps, min_pts): noise = NpCluster() visited = NpCluster() clusters = [] for point in dataset: cluster = NpCluster() if not visited.exist(point): visited.append(point) neighbors = region_query(dataset, point, eps) if len(neighbors) < min_pts: noise.append(point) else: cluster.append(point) expand_cluster(visited, dataset, neighbors, cluster, eps, min_pts) clusters.append(cluster) for data in clusters: print(data.value) plot_data(np.mat(data.value), cs[clusters.index(data)]) if noise.value: plot_data(np.mat(noise.value), 'green') plt.show() def plot_data(samples, color, plot_type='o'): plt.plot(samples[:, 0], samples[:, 1], plot_type, markerfacecolor=color, markersize=14) def expand_cluster(visited, dataset, neighbors, cluster, eps, min_pts): for point in neighbors: if not visited.exist(point): visited.append(point) point_neighbors = region_query(dataset, point, eps) if len(point_neighbors) >= min_pts: for expand_point in point_neighbors: if not neighbors.exist(expand_point): neighbors.append(expand_point) if not cluster.exist(point): cluster.append(point) init_data = create_sample() dbscan(init_data['data_mat'], 10, 3)
聚類結果:
能夠看到,點被很好的聚類爲兩個簇,右上角是噪聲。