1. K近鄰算法原理html
a. k近鄰算法是一種基本的分類與迴歸方法node
分類問題: 對於新的樣本,根據其k個最近鄰的訓練樣本的標籤,經過多數表決的方式進行預測算法
迴歸問題: 對於新的樣本,根據其k個最近鄰的訓練樣本標籤值的均值做爲預測值數據結構
b. k近鄰算法不具備顯示的學習過程,屬於直接預測,是惰性學習的表明app
c. k近鄰算法是一個非參數學習算法,沒有任何的參數(k屬於超參數,不是學習的參數)less
k近鄰算法具備很高的容量,訓練的樣本數量比較大時可以得到較高的精度dom
缺點:計算成本高,須要構建一個N*N的距離舉證,計算量爲O(N*N),其中N爲訓練樣本的數量ide
當數據集是幾十個億樣本時,計算量不可以接收函數
數據集很小時,泛化能力不好,容易過擬合學習
沒法判斷特徵的重要性
d. k近鄰算法的三個重要因素
k值的選擇
距離度量
決策規則
2. k值的選擇
a. 當k值爲1時,爲最近鄰算法,將訓練集中與輸入最近的點的類別做爲輸入點的分類
b. 當k的值較小,至關於用較小領域中的訓練樣本進行預測,學習的誤差較小
當近鄰的點剛好爲噪聲,預測會出錯,K值減少意味這模型總體變複雜,容易出現過擬合
優勢: 減少學習的誤差
缺點: 增大學習的方差(波動較大)
c. 當k的值較大,至關於用較大的領域中的訓練樣本進行預測
輸入樣本較遠的訓練樣本也會對預測起做用,使預測偏離預期的結果
優勢: 減少學習的方差(波動較小)
缺點: 增大學習的誤差
d. 應用中,k通常選區一個較小的數值,經過交叉驗證法來選取最優的k值
3. 距離度量
a. 特徵空間中兩個樣本之間的距離是兩個樣本類似程度的反應
k近鄰模型中通常是n維實數向量空間,距離通常選取歐式距離
b. 不一樣的距離度量肯定的最近鄰點是不一樣的
4. 決策規則
a. 分類決策的規則
分類決策一般採用多數表決,也能夠基於距離的遠近進行加權投票,距離越近的樣本權重越大
多數表決等價於經驗風險最小化
b. 迴歸決策規則
迴歸決策一般採用均值迴歸,也能夠基於距離的遠近進行加權投票,距離越近的樣本權重越大
均值迴歸等價於經驗風險最小化
5. kd樹:對訓練數據進行快速的k近鄰搜索
a. 實現k近鄰算法時,主要考慮的問題是如何快速的對訓練數據進行k近鄰搜索
b. 最簡單的方法:線性掃描(強制破解),計算輸入樣本與每個訓練樣本之間的距離
c. kd樹是一種對k維空間的樣本進行存儲以便進行快速搜索的樹形數據結構,它是一個二叉樹,表示對k維空間的一個劃分
d. 構建kd樹的過程至關於不斷的用垂直於座標軸的超平面對k維空間切分的過程,kd樹的每個節點對應於一個k維超矩形區域
6. kd樹的構建算法
步驟1: 以x1爲軸,樣本集中全部樣本的x1的中位數x1*爲切分點,將根節點的超矩形切分紅兩個子區域,切分產生深度爲1的左右子節點。左子節點對應x1 < x1*的區域,右子節點對應座標x1 > x1*的子區域,落在切分超平面的點存在根節點
步驟2:對深度爲j的節點,選擇xl爲切分的座標軸繼續切分,l = j(mod k) + 1,切分後,樹的深度爲j + 1
步驟3:直到全部的節點的兩個子域中沒有樣本存在時,切分中止,造成kd樹的區域劃分
7. kd樹的搜索算法
步驟1: 在kd樹中找到包含測試點的葉節點,從根節點出發,遞歸訪問Kd樹
若測試點當前維度座標小於切分點的座標,則查找當前節點的左子節點
若測試點當前緯度的座標大於切分點的座標,則查找當前節點的右子節點
在訪問的過程當中記錄訪問的各個節點的順序,存放在先進後出的隊列中,以便後面的回退
步驟2: 以此葉節點爲當前最近子節點Xnst,真實的最近點必定在測試點與當前最近子節點構成的超球體內,測試點爲球心
步驟3:從隊列中彈出節點,設爲Xinew(每次回退都是回退到kd樹的父節點)
若Xinew比Xnst更近,則更新Xnst
考察節點Xinew所在的超平面與以測試點爲球心,以測試點到Xnst的距離爲半徑的超球體是否相交
相交:
若測試點是Xinew的左子節點,則進入Xinew的右子節點,而後進行向下的全部並更新隊列Queue,而後向上回退
若測試點是Xinew的右子節點,則進入Xinew的左子節點,而後進行向下搜索並更新隊列Queue,而後向上回退
不相交:直接回退
步驟4:當回退到根節點時,搜索結束,最後的當前最近點即爲測試點的最近鄰點
kd樹的搜索的平均計算複雜度爲O(log N),N爲訓練集的大小
一般最近鄰搜索只須要檢測最近幾個葉節點便可
8. KNN實現
class sklearn.neighbors.
KNeighborsClassifier
(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
n_neighbors : int, optional (default = 5)
Number of neighbors to use by default for kneighbors
queries.
weights : str or callable, optional (default = ‘uniform’)
weight function used in prediction. Possible values:
algorithm : {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
BallTree
KDTree
fit
method.Note: fitting on sparse input will override the setting of this parameter, using brute force.
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
p : integer, optional (default = 2)
Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
metric : string or callable, default ‘minkowski’
the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of the DistanceMetric class for a list of available metrics.
metric_params : dict, optional (default = None)
Additional keyword arguments for the metric function.
n_jobs : int or None, optional (default=None)
The number of parallel jobs to run for neighbors search. None
means 1 unless in a joblib.parallel_backend
context. -1
means using all processors. See Glossary for more details. Doesn’t affect fit
method.
分類實例:
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('../datasets/Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) # Fitting K-NN to the Training set from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report cm = confusion_matrix(y_test, y_pred) print(cm) print(classification_report(y_test, y_pred))
9. KDTree
KDTree(X, leaf_size=40, metric=’minkowski’, **kwargs)
X : array-like, shape = [n_samples, n_features]
n_samples is the number of points in the data set, and n_features is the dimension of the parameter space. Note: if X is a C-contiguous array of doubles then data will not be copied. Otherwise, an internal copy will be made.
leaf_size : positive integer (default = 40)
Number of points at which to switch to brute-force. Changing leaf_size will not affect the results of a query, but can significantly impact the speed of a query and the memory required to store the constructed tree. The amount of memory needed to store the tree scales as approximately n_samples / leaf_size. For a specified leaf_size
, a leaf node is guaranteed to satisfy leaf_size <= n_points <= 2 * leaf_size
, except in the case that n_samples < leaf_size
.
metric : string or DistanceMetric object
the distance metric to use for the tree. Default=’minkowski’ with p=2 (that is, a euclidean metric). See the documentation of the DistanceMetric class for a list of available metrics. kd_tree.valid_metrics gives a list of the metrics which are valid for KDTree.
Additional keywords are passed to the distance metric class.
測試:
>>> import numpy as np >>> rng = np.random.RandomState(0) >>> X = rng.random_sample((10, 3)) # 10 points in 3 dimensions >>> tree = KDTree(X, leaf_size=2) >>> dist, ind = tree.query(X[:1], k=3) # 1表示查詢X第0行到其它行元素的距離,3表示查詢3個最近的距離 >>> print(ind) # indices of 3 closest neighbors [0 3 1] >>> print(dist) # distances to 3 closest neighbors [ 0. 0.19662693 0.29473397]
還能夠計算點到指定半徑距離內有哪些點。