9，K-近鄰算法（KNN）

時間 2019-12-01

標籤近鄰算法 knn 简体版

原文原文鏈接

導引：html

如何進行電影分類

衆所周知，電影能夠按照題材分類，然而題材自己是如何定義的?由誰來斷定某部電影屬於哪個題材?也就是說同一題材的電影具備哪些公共特徵?這些都是在進行電影分類時必需要考慮的問題。沒有哪一個電影人會說本身製做的電影和之前的某部電影相似，但咱們確實知道每部電影在風格上的確有可能會和同題材的電影相近。那麼動做片具備哪些共有特徵，使得動做片之間很是相似，而與愛情片存在着明顯的差異呢？動做片中也會存在接吻鏡頭，愛情片中也會存在打鬥場景，咱們不能單純依靠是否存在打鬥或者親吻來判斷影片的類型。可是愛情片中的親吻鏡頭更多，動做片中的打鬥場景也更頻繁，基於此類場景在某部電影中出現的次數能夠用來進行電影分類。python

本章介紹第一個機器學習算法：K-近鄰算法，它很是有效並且易於掌握。算法

一、k-近鄰算法原理

簡單地說，K-近鄰算法採用測量不一樣特徵值之間的距離方法進行分類。app

優勢：精度高(計算距離)、對異常值不敏感（單純根據距離進行分類，會忽略特殊狀況）、無數據輸入假定（不會對數據預先進行斷定）。
缺點：時間複雜度高、空間複雜度高。
適用數據範圍：數值型和標稱型。

工做原理

存在一個樣本數據集合，也稱做訓練樣本集，而且樣本集中每一個數據都存在標籤，即咱們知道樣本集中每一數據與所屬分類的對應關係。輸人沒有標籤的新數據後，將新數據的每一個特徵與樣本集中數據對應的特徵進行比較，而後算法提取樣本集中特徵最類似數據（最近鄰）的分類標籤。通常來講，咱們只選擇樣本數據集中前K個最類似的數據，這就是K-近鄰算法中K的出處,一般K是不大於20的整數。最後，選擇K個最類似數據中出現次數最多的分類，做爲新數據的分類。dom

回到前面電影分類的例子，使用K-近鄰算法分類愛情片和動做片。有人曾經統計過不少電影的打鬥鏡頭和接吻鏡頭，下圖顯示了6部電影的打鬥和接吻次數。假若有一部未看過的電影，如何肯定它是愛情片仍是動做片呢？咱們可使用K-近鄰算法來解決這個問題。機器學習

首先咱們須要知道這個未知電影存在多少個打鬥鏡頭和接吻鏡頭，上圖中問號位置是該未知電影出現的鏡頭數圖形化展現，具體數字參見下表。ide

即便不知道未知電影屬於哪一種類型，咱們也能夠經過某種方法計算出來。首先計算未知電影與樣本集中其餘電影的距離，如圖所示。學習

如今咱們獲得了樣本集中全部電影與未知電影的距離，按照距離遞增排序，能夠找到K個距離最近的電影。假定k=3，則三個最靠近的電影依次是California Man、He's Not Really into Dudes、Beautiful Woman。K-近鄰算法按照距離最近的三部電影的類型，決定未知電影的類型，而這三部電影全是愛情片，所以咱們斷定未知電影是愛情片。測試

歐幾里得距離(Euclidean Distance)

歐氏距離是最多見的距離度量，衡量的是多維空間中各個點之間的絕對距離。公式以下：this

二、在scikit-learn庫中使用k-近鄰算法

分類問題：from sklearn.neighbors import KNeighborsClassifier

Type Markdown and LaTeX: $α^{2}$

0）一個最簡單的例子

身高、體重、鞋子尺碼數據對應性別

import numpy as np
import pandas  as pd
from pandas import DataFrame,Series

feature = np.array([[170,75,41],[166,65,38],[177,80,43],[179,80,43],[170,60,40],[160,55,38]])
target = np.array(['男','女','男','男','女','女'])

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(feature,target)
knn.score(feature,target)  #1.0

knn.predict(np.array([[176,71,38]]))
# array(['男'], dtype='<U1')

查看電影屬於那個類別

df = pd.read_excel('../../my_films.xlsx')
df

feature = df[['Action lens','Love lens']]
target = df['target']

knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(feature,target)
knn.score(feature,target)  #1.0

knn.predict(np.array([[60,42]]))

#  array(['Action'], dtype=object)

1）用於分類

數據藍蝴蝶以及k值的算法

import sklearn.datasets as datasets
iris = datasets.load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)']}

iris

#提取樣本數據
feature = iris['data']
target = iris['target']

#將樣本數據進行隨機打亂
np.random.seed(1)
np.random.shuffle(feature)
np.random.seed(1)
np.random.shuffle(target)

feature.shape #(150, 4)

#獲取訓練樣本數據和測試樣本數據
#訓練數據
x_train = feature[:140]
y_train = target[:140]
#測試數據
x_test = feature[-10:]
y_test =target[-10:]

#實例化模型對象&訓練模型
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train,y_train)
knn.score(x_train,y_train)   #0.9857142857142858

print('預測分類：',knn.predict(x_test))
print('真實分類：',y_test)
預測分類： [0 2 1 2 0 1 2 1 1 0]
真實分類： [0 2 1 1 0 1 2 1 1 0]

# 選中最優的k值
k_list = []
s_list = []

for k in range(1,60):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train,y_train)
    s_list.append(s)
    k_list.append(k)

import matplot.pyplot as plt
plt.plot(k_list,s_list)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。