機器學習入門教程-k-近鄰

時間 2019-12-05

標籤機器學習入門教程近鄰简体版

原文原文鏈接

k-近鄰算法原理

像以前提到的那樣，機器學習的一個要點就是分類，對於分類來講有許多不一樣的算法，所謂的物以聚類，分以羣分。咱們很是的清楚，一個地域的人羣，無論在生活習慣，仍是在習俗上都是很是類似的，也就是咱們說的一類人。每一類人都會造成本身的一箇中心，越靠近這個中心的人越爲類似。k近鄰算法就是爲了找到這個中心點，把這中心點當成這類關鍵點，在有新的數據須要分類的話，就看離哪一個中心點近，那麼就屬於哪一類。python

假設咱們有這樣的一組數據，他表明一我的的地理座標位置：算法

x座標	y座標	哪省人
4.035615117	4.920529835	0
4.665299994	4.702897321	0
1.711128297	1.031989236	1

根據這座標在圖上繪出圖形：編程

兩個藍色的點互相靠近，它們的屬性應該是類似的，而紅色的點，離這兩個藍色的點有必定的距離，可能屬於另外一個聚合。app

在這裏導入一組數據，這一組數據中有三個分類，每個分類就是一個羣，組成了三個中心，具體的數據和圖以下：dom

import numpy as np
import random
import matplotlib.pyplot as plt

def read_clusters(clustersfile):
    cl = []
    tl = []
    with open(clustersfile, 'r') as f:
        for line in f:
            line = line.strip()
            if line != '':
                line = line.split()
                constraint = [float(line[0]), float(line[1])]

                cl.append(constraint)
                tl.append(int(line[2]))
    return cl,tl

train_data,train_labels = read_clusters('clusters3.txt')
train_data = np.array(train_data)
key_name = {0:'red',1:'blue',2:'orange'}

for i in range(train_data.shape[0]):
    plt.scatter(train_data[i:i + 1, 0:1], train_data[i:i + 1, 1:2], c=key_name[train_labels[i]], marker='o',s=20)

plt.savefig('clusters.png')

k-近鄰算法步驟

k-近鄰的通常步驟以下：機器學習

1.先隨機的產生幾個中心，中心點的確認來自於須要組建幾個類羣。ide

def _init_random_centroids(self, data):
    n_samples, n_features = np.shape(data)
    centroids = np.zeros((self.k, n_features))
    for i in range(self.k):
        centroid = data[np.random.choice(range(n_samples))]
        centroids[i] = centroid
    return centroids

2.接下來是把全部的數據點跟這幾個中心點進行比較，數據點裏哪一個中心點近，那麼這個點就屬於哪一個類羣。學習

計算距離的公式以下：idea

def euclidean_distance(vec_1, vec_2):
    if(len(vec_1) != len(vec_2)):
        raise Exception("The two vectors do NOT have equal length")

    distance = 0
    for i in range(len(vec_1)):
        distance += pow((vec_1[i] - vec_2[i]), 2)

    return np.sqrt(distance)

根據距離查找屬於哪一個中心點。code

def _closest_centroid(self, sample, centroids):
    closest_i = None
    closest_distance = float("inf")
    for i, centroid in enumerate(centroids):
        distance = ml_helpers.euclidean_distance(sample, centroid)
        if distance < closest_distance:
            closest_i = i
            closest_distance = distance
    return closest_i

3.經過中心點肯定了類羣，在經過類羣更新中心點。中心點是這個類羣全部點的均值點，計算均值更新中心點。

def _calculate_centroids(self, clusters, data):
    n_features = np.shape(data)[1]
    centroids = np.zeros((self.k, n_features))
    for i, cluster in enumerate(clusters):
        centroid = np.mean(data[cluster], axis=0)
        centroids[i] = centroid
    return centroids

4.不斷的更新這一個過程，直到中心點不在變化。

整個過程以下：

import numpy as np
import random
import sys

import matplotlib.pyplot as plt

def euclidean_distance(vec_1, vec_2):
    if(len(vec_1) != len(vec_2)):
        raise Exception("The two vectors do NOT have equal length")

    distance = 0
    for i in range(len(vec_1)):
        distance += pow((vec_1[i] - vec_2[i]), 2)

    return np.sqrt(distance)
    
def read_clusters(clustersfile):
    cl = []
    tl = []
    with open(clustersfile, 'r') as f:
        for line in f:
            line = line.strip()
            if line != '':
                line = line.split()
                constraint = [float(line[0]), float(line[1])]

                cl.append(constraint)
                tl.append(int(line[2]))
    return cl,tl


class KMeans():
    def __init__(self, k=2, max_iterations=500):
        self.k = k
        self.max_iterations = max_iterations
        self.kmeans_centroids = []

    def _init_random_centroids(self, data):
        n_samples, n_features = np.shape(data)
        centroids = np.zeros((self.k, n_features))
        for i in range(self.k):
            centroid = data[np.random.choice(range(n_samples))]
            centroids[i] = centroid
        return centroids

    def _closest_centroid(self, sample, centroids):
        closest_i = None
        closest_distance = float("inf")
        for i, centroid in enumerate(centroids):
            distance = euclidean_distance(sample, centroid)
            if distance < closest_distance:
                closest_i = i
                closest_distance = distance
        return closest_i

    def _create_clusters(self, centroids, data):
        n_samples = np.shape(data)[0]
        clusters = [[] for _ in range(self.k)]
        for sample_i, sample in enumerate(data):        
            centroid_i = self._closest_centroid(sample, centroids)
            clusters[centroid_i].append(sample_i)
        return clusters

    def _calculate_centroids(self, clusters, data):
        n_features = np.shape(data)[1]
        centroids = np.zeros((self.k, n_features))
        for i, cluster in enumerate(clusters):
            centroid = np.mean(data[cluster], axis=0)
            centroids[i] = centroid
        return centroids

    def _get_cluster_labels(self, clusters, data):
        y_pred = np.zeros(np.shape(data)[0])
        for cluster_i, cluster in enumerate(clusters):
            for sample_i in cluster:
                y_pred[sample_i] = cluster_i
        return y_pred

    def fit(self, data):
        centroids = self._init_random_centroids(data)

        for iteration in range(self.max_iterations):


            clusters = self._create_clusters(centroids, data)

            prev_centroids = centroids

            centroids = self._calculate_centroids(clusters, data)

            diff = centroids - prev_centroids
            if not diff.any():
                break

        self.kmeans_centroids = centroids
        return centroids

    def predict(self, data):


        if not self.kmeans_centroids.any():
            raise Exception("K-Means centroids have not yet been determined.\nRun the K-Means 'fit' function first.")

        clusters = self._create_clusters(self.kmeans_centroids, data)

        predicted_labels = self._get_cluster_labels(clusters, data)

        return predicted_labels



key_name = {0:'red',1:'blue',2:'orange'}




clf = KMeans(k=3, max_iterations=3000)

train_data,train_labels = read_clusters('clusters3.txt')
train_data = np.array(train_data)
centroids = clf.fit(train_data)
print centroids

中心點不斷更新的過程以下：

算法偏差估計

檢驗算法的好壞，簡單的辦法是把一部分的數據用來訓練，一部分的數據用來檢驗，查看算法的結果跟預計的數據相差多少？

下面是算法的效果估計：

Accuracy = 0
for index in range(len(train_labels)):
    # Cluster the data using K-Means
    current_label = train_labels[index]
    predicted_label = predicted_labels[index]

    if current_label == int(predicted_label):
        Accuracy += 1

Accuracy /= len(train_labels)

print Accuracy

輸出的結果爲

準確率達到100%。

sklearn 下的k-近鄰算法

在學習算法的時候知道了原理，經過本身的代碼對算法的原理進行編寫，一般來說這很方便學習，在知道了如何編寫算法之後，能夠直接使用現成的開源庫，直接使用該算法，sklearn就很是方便使用。

clf = cluster.KMeans(n_clusters=3, max_iter=3000, n_init=10)
kmeans = clf.fit(train_data)

Accuracy = 0
for index in range(len(train_labels)):
    # Cluster the data using K-Means
    current_sample = train_data[index].reshape(1,-1) 
    current_label = train_labels[index]
    predicted_label = kmeans.predict(current_sample)
    if current_label == predicted_label:
        Accuracy += 1

Accuracy /= len(train_labels)