像以前提到的那樣,機器學習的一個要點就是分類,對於分類來講有許多不一樣的算法,所謂的物以聚類,分以羣分。咱們很是的清楚,一個地域的人羣,無論在生活習慣,仍是在習俗上都是很是類似的,也就是咱們說的一類人。每一類人都會造成本身的一箇中心,越靠近這個中心的人越爲類似。k近鄰算法就是爲了找到這個中心點,把這中心點當成這類關鍵點,在有新的數據須要分類的話,就看離哪一個中心點近,那麼就屬於哪一類。python
假設咱們有這樣的一組數據,他表明一我的的地理座標位置:算法
x座標 | y座標 | 哪省人 |
---|---|---|
4.035615117 | 4.920529835 | 0 |
4.665299994 | 4.702897321 | 0 |
1.711128297 | 1.031989236 | 1 |
根據這座標在圖上繪出圖形:編程
兩個藍色
的點互相靠近,它們的屬性應該是類似的,而紅色
的點,離這兩個藍色
的點有必定的距離,可能屬於另外一個聚合。app
在這裏導入一組數據,這一組數據中有三個分類,每個分類就是一個羣,組成了三個中心,具體的數據和圖以下:dom
import numpy as np import random import matplotlib.pyplot as plt def read_clusters(clustersfile): cl = [] tl = [] with open(clustersfile, 'r') as f: for line in f: line = line.strip() if line != '': line = line.split() constraint = [float(line[0]), float(line[1])] cl.append(constraint) tl.append(int(line[2])) return cl,tl train_data,train_labels = read_clusters('clusters3.txt') train_data = np.array(train_data) key_name = {0:'red',1:'blue',2:'orange'} for i in range(train_data.shape[0]): plt.scatter(train_data[i:i + 1, 0:1], train_data[i:i + 1, 1:2], c=key_name[train_labels[i]], marker='o',s=20) plt.savefig('clusters.png')
k-近鄰的通常步驟以下:機器學習
1.先隨機的產生幾個中心,中心點的確認來自於須要組建幾個類羣。ide
def _init_random_centroids(self, data): n_samples, n_features = np.shape(data) centroids = np.zeros((self.k, n_features)) for i in range(self.k): centroid = data[np.random.choice(range(n_samples))] centroids[i] = centroid return centroids
2.接下來是把全部的數據點跟這幾個中心點進行比較,數據點裏哪一個中心點近,那麼這個點就屬於哪一個類羣。學習
計算距離的公式以下:idea
def euclidean_distance(vec_1, vec_2): if(len(vec_1) != len(vec_2)): raise Exception("The two vectors do NOT have equal length") distance = 0 for i in range(len(vec_1)): distance += pow((vec_1[i] - vec_2[i]), 2) return np.sqrt(distance)
根據距離查找屬於哪一個中心點。code
def _closest_centroid(self, sample, centroids): closest_i = None closest_distance = float("inf") for i, centroid in enumerate(centroids): distance = ml_helpers.euclidean_distance(sample, centroid) if distance < closest_distance: closest_i = i closest_distance = distance return closest_i
3.經過中心點肯定了類羣,在經過類羣更新中心點。中心點是這個類羣全部點的均值點,計算均值更新中心點。
def _calculate_centroids(self, clusters, data): n_features = np.shape(data)[1] centroids = np.zeros((self.k, n_features)) for i, cluster in enumerate(clusters): centroid = np.mean(data[cluster], axis=0) centroids[i] = centroid return centroids
4.不斷的更新這一個過程,直到中心點不在變化。
整個過程以下:
import numpy as np import random import sys import matplotlib.pyplot as plt def euclidean_distance(vec_1, vec_2): if(len(vec_1) != len(vec_2)): raise Exception("The two vectors do NOT have equal length") distance = 0 for i in range(len(vec_1)): distance += pow((vec_1[i] - vec_2[i]), 2) return np.sqrt(distance) def read_clusters(clustersfile): cl = [] tl = [] with open(clustersfile, 'r') as f: for line in f: line = line.strip() if line != '': line = line.split() constraint = [float(line[0]), float(line[1])] cl.append(constraint) tl.append(int(line[2])) return cl,tl class KMeans(): def __init__(self, k=2, max_iterations=500): self.k = k self.max_iterations = max_iterations self.kmeans_centroids = [] def _init_random_centroids(self, data): n_samples, n_features = np.shape(data) centroids = np.zeros((self.k, n_features)) for i in range(self.k): centroid = data[np.random.choice(range(n_samples))] centroids[i] = centroid return centroids def _closest_centroid(self, sample, centroids): closest_i = None closest_distance = float("inf") for i, centroid in enumerate(centroids): distance = euclidean_distance(sample, centroid) if distance < closest_distance: closest_i = i closest_distance = distance return closest_i def _create_clusters(self, centroids, data): n_samples = np.shape(data)[0] clusters = [[] for _ in range(self.k)] for sample_i, sample in enumerate(data): centroid_i = self._closest_centroid(sample, centroids) clusters[centroid_i].append(sample_i) return clusters def _calculate_centroids(self, clusters, data): n_features = np.shape(data)[1] centroids = np.zeros((self.k, n_features)) for i, cluster in enumerate(clusters): centroid = np.mean(data[cluster], axis=0) centroids[i] = centroid return centroids def _get_cluster_labels(self, clusters, data): y_pred = np.zeros(np.shape(data)[0]) for cluster_i, cluster in enumerate(clusters): for sample_i in cluster: y_pred[sample_i] = cluster_i return y_pred def fit(self, data): centroids = self._init_random_centroids(data) for iteration in range(self.max_iterations): clusters = self._create_clusters(centroids, data) prev_centroids = centroids centroids = self._calculate_centroids(clusters, data) diff = centroids - prev_centroids if not diff.any(): break self.kmeans_centroids = centroids return centroids def predict(self, data): if not self.kmeans_centroids.any(): raise Exception("K-Means centroids have not yet been determined.\nRun the K-Means 'fit' function first.") clusters = self._create_clusters(self.kmeans_centroids, data) predicted_labels = self._get_cluster_labels(clusters, data) return predicted_labels key_name = {0:'red',1:'blue',2:'orange'} clf = KMeans(k=3, max_iterations=3000) train_data,train_labels = read_clusters('clusters3.txt') train_data = np.array(train_data) centroids = clf.fit(train_data) print centroids
中心點不斷更新的過程以下:
檢驗算法的好壞,簡單的辦法是把一部分的數據用來訓練,一部分的數據用來檢驗,查看算法的結果跟預計的數據相差多少?
下面是算法的效果估計:
Accuracy = 0 for index in range(len(train_labels)): # Cluster the data using K-Means current_label = train_labels[index] predicted_label = predicted_labels[index] if current_label == int(predicted_label): Accuracy += 1 Accuracy /= len(train_labels) print Accuracy
輸出的結果爲
1
準確率達到100%。
在學習算法的時候知道了原理,經過本身的代碼對算法的原理進行編寫,一般來說這很方便學習,在知道了如何編寫算法之後,能夠直接使用現成的開源庫,直接使用該算法,sklearn
就很是方便使用。
clf = cluster.KMeans(n_clusters=3, max_iter=3000, n_init=10) kmeans = clf.fit(train_data) Accuracy = 0 for index in range(len(train_labels)): # Cluster the data using K-Means current_sample = train_data[index].reshape(1,-1) current_label = train_labels[index] predicted_label = kmeans.predict(current_sample) if current_label == predicted_label: Accuracy += 1 Accuracy /= len(train_labels)
k-近鄰算法用來找到中心點,同時算法也能夠用來進行去重,把重複的附近的點都把他近似爲中心點。
轉載請標明來之:http://www.bugingcode.com/
更多教程:阿貓學編程