OpenCV3的機器學習算法-使用Pythonhtml
英文:http://docs.opencv.org/master/d5/d26/tutorial_py_knn_understanding.html
算法
In this chapter, we will understand the concepts of k-Nearest Neighbour (kNN) algorithm.數組
kNN is one of the simplest of classification algorithms available for supervised learning. The idea is to search for closest match of the test data in feature space. We will look into it with below image.app
• 本節咱們要理解 k 近鄰(kNN)的基本概念、原理。dom
kNN 能夠說是最簡單的監督學習分類器了。想法也很簡單,就是找出測試 數據在特徵空間中的最近鄰居。咱們將使用下面的圖片介紹它。機器學習
In the image, there are two families, Blue Squares and Red Triangles. We call each family as Class. Their houses are shown in their town map which we call feature space. *(You can consider a feature space as a space where all datas are projected. For example, consider a 2D coordinate space. Each data has two features, x and y coordinates. You can represent this data in your 2D coordinate space, right? Now imagine if there are three features, you need 3D space. Now consider N features, where you need N-dimensional space, right? This N-dimensional space is its feature space. In our image, you can consider it as a 2D case with two features)*.ide
上圖中的對象能夠分紅兩組,藍色方塊和紅色三角。每一組也能夠稱爲一 個類。咱們能夠把全部的這些對象當作是一個城鎮中房子,而全部的房子分別 屬於藍色和紅色家族,而這個城鎮就是所謂的特徵空間。(你能夠把一個特徵空 間當作是全部點的投影所在的空間。例如在一個 2D 的座標空間中,每一個數據 都兩個特徵 x 座標和 y 座標,你能夠在 2D 座標空間中表示這些數據。若是每 個數據都有 3 個特徵呢,咱們就須要一個 3D 空間。N 個特徵就須要 N 維空 間,這個 N 維空間就是特徵空間。在上圖中,咱們能夠認爲是具備兩個特徵色2D 空間)。學習
Now a new member comes into the town and creates a new home, which is shown as green circle. He should be added to one of these Blue/Red families. We call that process, Classification. What we do? Since we are dealing with kNN, let us apply this algorithm.
測試
如今城鎮中來了一個新人,他的新房子用綠色圓盤表示。咱們要根據他房 子的位置把他歸爲藍色家族或紅色家族。咱們把這過程成爲分類。咱們應該怎 麼作呢?由於咱們正在學習看 kNN,那咱們就使用一下這個算法吧。ui
One method is to check who is his nearest neighbour. From the image, it is clear it is the Red Triangle family. So he is also added into Red Triangle. This method is called simply Nearest Neighbour, because classification depends only on the nearest neighbour.
一個方法就是查看他最近的鄰居屬於那個家族,從圖像中咱們知道最近的 是紅色三角家族。因此他被分到紅色家族。這種方法被稱爲簡單近鄰,由於分 類僅僅決定與它最近的鄰居。
But there is a problem with that. Red Triangle may be the nearest. But what if there are lot of Blue Squares near to him? Then Blue Squares have more strength in that locality than Red Triangle. So just checking nearest one is not sufficient. Instead we check some k nearest families. Then whoever is majority in them, the new guy belongs to that family. In our image, let's take k=3, ie 3 nearest families. He has two Red and one Blue (there are two Blues equidistant, but since k=3, we take only one of them), so again he should be added to Red family. But what if we take k=7? Then he has 5 Blue families and 2 Red families. Great!! Now he should be added to Blue family. So it all changes with value of k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!! So better take k as an odd number. So this method is called k-Nearest Neighbour since classification depends on k nearest neighbours.
可是這裏還有一個問題。紅色三角多是最近的,但若是他周圍還有不少 藍色方塊怎麼辦呢?此時藍色方塊對局部的影響應該大於紅色三角。因此僅僅 檢測最近的一個鄰居是不足的。因此咱們檢測 k 個最近鄰居。誰在這 k 個鄰 居中佔據多數,那新的成員就屬於誰那一類。若是 k 等於 3,也就是在上面圖像中檢測 3 個最近的鄰居。他有兩個紅的和一個藍的鄰居,因此他仍是屬於紅 色家族。可是若是 k 等於 7 呢?他有 5 個藍色和 2 個紅色鄰居,如今他就會 被分到藍色家族了。k 的取值對結果影響很是大。更有趣的是,若是 k 等於 4呢?兩個紅兩個藍。這是一個死結。因此 k 的取值最好爲奇數。這中根據 k 個 最近鄰居進行分類的方法被稱爲 kNN。
Again, in kNN, it is true we are considering k neighbours, but we are giving equal importance to all, right? Is it justice? For example, take the case of k=4. We told it is a tie. But see, the 2 Red families are more closer to him than the other 2 Blue families. So he is more eligible to be added to Red. So how do we mathematically explain that? We give some weights to each family depending on their distance to the new-comer. For those who are near to him get higher weights while those are far away get lower weights. Then we add total weights of each family separately. Whoever gets highest total weights, new-comer goes to that family. This is called modified kNN.
在 kNN 中咱們考慮了 k 個最近鄰居,可是咱們給了這些鄰居相等的權 重,這樣作公平嗎?以 k 等於 4 爲例,咱們說她是一個死結。可是兩個紅色三 角比兩個藍色方塊距離新成員更近一些。因此他更應該被分爲紅色家族。那用 數學應該如何表示呢?咱們要根據每一個房子與新房子的距離對每一個房子賦予不 同的權重。距離近的具備更高的權重,距離遠的權重更低。而後咱們根據兩個 家族的權重和來判斷新房子的歸屬,誰的權重大就屬於誰。這被稱爲修改過的kNN。
So what are some important things you see here?
You need to have information about all the houses in town, right? Because, we have to check the distance from new-comer to all the existing houses to find the nearest neighbour. If there are plenty of houses and families, it takes lots of memory, and more time for calculation also.
There is almost zero time for any kind of training or preparation.
Now let's see it in OpenCV.
那這裏面些是重要的呢?
• 咱們須要整個城鎮中每一個房子的信息。由於咱們要測量新來者到全部現存 房子的距離,並在其中找到最近的。若是那裏有不少房子,就要佔用很大的內存和更多的計算時間。
• 訓練和處理幾乎不須要時間。 如今咱們看看 OpenCV 中的 kNN。
We will do a simple example here, with two families (classes), just like above. Then in the next chapter, we will do an even better example.
So here, we label the Red family as Class-0 (so denoted by 0) and Blue family as Class-1 (denoted by 1). We create 25 families or 25 training data, and label them either Class-0 or Class-1. We do all these with the help of Random Number Generator in Numpy.
Then we plot it with the help of Matplotlib. Red families are shown as Red Triangles and Blue families are shown as Blue Squares.
咱們這裏來舉一個簡單的例子,和上面同樣有兩個類。下一節咱們會有一 個更好的例子。
這裏咱們將紅色家族標記爲 Class-0,藍色家族標記爲 Class-1。還要 再建立 25 個訓練數據,把它們非別標記爲 Class-0 或者 Class-1。Numpy中隨機數產生器能夠幫助咱們完成這個任務。
而後藉助 Matplotlib 將這些點繪製出來。紅色家族顯示爲紅色三角藍色 家族顯示爲藍色方塊。
# -*- coding: utf-8 -*- """ Created on Tue Jan 28 18:00:18 2014 @author: duan """ import cv2import numpy as np import matplotlib.pyplot as plt # Feature set containing (x,y) values of 25 known/training data trainData = np.random.randint(0,100,(25,2)).astype(np.float32) # Labels each one either Red or Blue with numbers 0 and 1 responses = np.random.randint(0,2,(25,1)).astype(np.float32) # Take Red families and plot them red = trainData[responses.ravel()==0] plt.scatter(red[:,0],red[:,1],80,'r','^') # Take Blue families and plot them blue = trainData[responses.ravel()==1] plt.scatter(blue[:,0],blue[:,1],80,'b','s') plt.show()
You will get something similar to our first image. Since you are using random number generator, you will be getting different data each time you run the code.
Next initiate the kNN algorithm and pass the trainData and responses to train the kNN (It constructs a search tree).
Then we will bring one new-comer and classify him to a family with the help of kNN in OpenCV. Before going to kNN, we need to know something on our test data (data of new comers). Our data should be a floating point array with size numberoftestdata×numberoffeatures. Then we find the nearest neighbours of new-comer. We can specify how many neighbours we want. It returns:
The label given to new-comer depending upon the kNN theory we saw earlier. If you want Nearest Neighbour algorithm, just specify k=1 where k is the number of neighbours.
The labels of k-Nearest Neighbours.
Corresponding distances from new-comer to each nearest neighbour.
So let's see how it works. New comer is marked in green color.
你可能會獲得一個與上面相似的圖形,但不會徹底同樣,由於你使用了隨 機數產生器,每次你運行代碼都會獲得不一樣的結果。
下面就是 kNN 算法分類器的初始化,咱們要傳入一個訓練數據集,以及與訓練數據對應的分類來訓練 kNN 分類器(構建搜索樹)。
最後要使用 OpenCV 中的 kNN 分類器,咱們給它一個測試數據,讓它來 進行分類。在使用 kNN 以前,咱們應該對測試數據有所瞭解。咱們的數據應 該是大小爲數據數目乘以特徵數目的浮點性數組。而後咱們就能夠經過計算找 到測試數據最近的鄰居了。咱們能夠設置返回的最近鄰居的數目。返回值包括:
1. 由 kNN 算法計算獲得的測試數據的類別標誌(0 或 1)。若是你想使用 最近鄰算法,只須要將 k 設置爲 1,k 就是最近鄰的數目。
2. k 個最近鄰居的類別標誌。
3. 每一個最近鄰居到測試數據的距離。 讓咱們看看它是如何工做的。測試數據被標記爲綠色。
newcomer = np.random.randint(0,100,(1,2)).astype(np.float32) plt.scatter(newcomer[:,0],newcomer[:,1],80,'g','o') knn = cv2.KNearest() knn.train(trainData,responses) ret, results, neighbours ,dist = knn.find_nearest(newcomer, 3) print "result: ", results,"\n" print "neighbours: ", neighbours,"\n"print "distance: ", dist plt.show()
I got the result as follows:
下面是我獲得的結果:
It says our new-comer got 3 neighbours, all from Blue family. Therefore, he is labelled as Blue family. It is obvious from plot below:
這說明咱們的測試數據有 3 個鄰居,他們都是藍色,因此它被分爲藍色家 族。結果很明顯,以下圖所示:
result: [[ 1.]] neighbours: [[ 1. 1. 1.]] distance: [[ 53. 58. 61.]]
If you have large number of data, you can just pass it as array. Corresponding results are also obtained as arrays.
若是咱們有大量的數據要進行測試,能夠直接傳入一個數組。對應的結果 一樣也是數組。
# 10 new comers newcomers = np.random.randint(0,100,(10,2)).astype(np.float32) ret, results,neighbours,dist = knn.find_nearest(newcomer, 3) # The results also will contain 10 labels.
更多資源
1. NPTEL notes on Pattern Recognition, Chapter 11