http://docs.opencv.org/master/d4/db1/tutorial_py_svm_basics.htmlhtml
In this chapter算法
We will see an intuitive understanding of SVMexpress
目標app
• 對 SVM 有一個直觀理解less
Consider the image below which has two types of data, red and blue. In kNN, for a test data, we used to measure its distance to all the training samples and take the one with minimum distance. It takes plenty of time to measure all the distances and plenty of memory to store all the training-samples. But considering the data given in image, should we need that much?機器學習
原理
線性數據分割ide
以下圖所示,其中含有兩類數據,紅的和藍的。若是是使用 kNN,對於一 個測試數據咱們要測量它到每個樣本的距離,從而根據最近鄰居分類。測量 全部的距離須要足夠的時間,而且須要大量的內存存儲訓練樣本。可是分類下 圖所示的數據真的須要佔用這麼多資源嗎?函數
image學習
Consider another idea. We find a line, f(x)=ax1+bx2+c which divides both the data to two regions. When we get a new test_data X, just substitute it in f(x). If f(X)>0, it belongs to blue group, else it belongs to red group. We can call this line as Decision Boundary. It is very simple and memory-efficient. Such data which can be divided into two with a straight line (or hyperplanes in higher dimensions) is called Linear Separable.測試
咱們在考慮另一個想法。咱們找到了一條直線,f (x) = ax1 + bx2 + c, 它能夠將全部的數據分割到兩個區域。當咱們拿到一個測試數據 X 時,咱們只 須要把它代入 f (x)。若是 |f (X)| > 0,它就屬於藍色組,不然就屬於紅色組。 咱們把這條線稱爲決定邊界(Decision_Boundary)。很簡單並且內存使用 效率也很高。這種使用一條直線(或者是高位空間種的超平面)上述數據分紅 兩組的方法成爲線性分割。
So in above image, you can see plenty of such lines are possible. Which one we will take? Very intuitively we can say that the line should be passing as far as possible from all the points. Why? Because there can be noise in the incoming data. This data should not affect the classification accuracy. So taking a farthest line will provide more immunity against noise. So what SVM does is to find a straight line (or hyperplane) with largest minimum distance to the training samples. See the bold line in below image passing through the center.
從上圖中咱們看到有不少條直線能夠將數據分爲藍紅兩組,那一條直線是 最好的呢?直覺上講這條直線應該是與兩組數據的距離越遠越好。爲何呢? 由於測試數據可能有噪音影響(真實數據 + 噪聲)。這些數據不該該影響分類 的準確性。因此這條距離遠的直線抗噪聲能力也就最強。因此 SVM 要作就是 找到一條直線,並使這條直線到(訓練樣本)各組數據的最短距離最大。下圖 中加粗的直線通過中心。
image
So to find this Decision Boundary, you need training data. Do you need all? NO. Just the ones which are close to the opposite group are sufficient. In our image, they are the one blue filled circle and two red filled squares. We can call them Support Vectors and the lines passing through them are called Support Planes. They are adequate for finding our decision boundary. We need not worry about all the data. It helps in data reduction.
要找到決定邊界,就須要使用訓練數據。咱們須要全部的訓練數據嗎?不, 只須要那些靠近邊界的數據,如上圖中一個藍色的圓盤和兩個紅色的方塊。我 們叫他們支持向量,通過他們的直線叫作支持平面。有了這些數據就足以找到 決定邊界了。咱們擔憂全部的數據。這對於數據簡化有幫助。
What happened is, first two hyperplanes are found which best represents the data. For eg, blue data is represented by wTx+b0>1 while red data is represented by wTx+b0<−1 where w is weight vector ( w=[w1,w2,...,wn]) and x is the feature vector ( x=[x1,x2,...,xn]). b0 is the bias. Weight vector decides the orientation of decision boundary while bias point decides its location. Now decision boundary is defined to be midway between these hyperplanes, so expressed as wTx+b0=0. The minimum distance from support vector to the decision boundary is given by, distancesupportvectors=1||w||. Margin is twice this distance, and we need to maximize this margin. i.e. we need to minimize a new function L(w,b0) with some constraints which can expressed below:
minw,b0L(w,b0)=12||w||2subject toti(wTx+b0)≥1∀i
where ti is the label of each class, ti∈[−1,1]
.
到底發生了什麼呢?首先咱們找到了分別表明兩組數據的超平面。例如,藍 色數據能夠用 ωT x+b0 > 1 表示,而紅色數據能夠用 ωT x+b0 < −1 表示,ω 叫 作權重向量(ω = [ω1,ω2,...,ω3]),x 爲特徵向量(x = [x1,x2,...,xn])。b0 被
成爲 bias(截距?)。權重向量決定了決定邊界的走向,而 bias 點決定了它(決
定邊界)的位置。決定邊界被定義爲這兩個超平面的中間線(平面),表達式爲
ωT x+b0 = 0。從支持向量到決定邊界的最短距離爲 distancesupport vector = 1 。||ω||
邊緣長度爲這個距離的兩倍,咱們須要使這個邊緣長度最大。咱們要建立一個 新的函數 L (ω, b0) 並使它的值最小:
其中 ti 是每一組的標記,ti ∈ [−1, 1]。
Consider some data which can't be divided into two with a straight line. For example, consider an one-dimensional data where 'X' is at -3 & +3 and 'O' is at -1 & +1. Clearly it is not linearly separable. But there are methods to solve these kinds of problems. If we can map this data set with a function, f(x)=x2, we get 'X' at 9 and 'O' at 1 which are linear separable.
非線性數據分割
想象一下,若是一組數據不能被一條直線分爲兩組怎麼辦?例如,在一維 空間中 X 類包含的數據點有(-3,3),O 類包含的數據點有(-1,1)。很明顯 不可能使用線性分割將 X 和 O 分開。可是有一個方法能夠幫咱們解決這個問 題。使用函數 f(x) = x2 對這組數據進行映射,獲得的 X 爲 9,O 爲 1,這時 就可使用線性分割了。
Otherwise we can convert this one-dimensional to two-dimensional data. We can use f(x)=(x,x2) function to map this data. Then 'X' becomes (-3,9) and (3,9) while 'O' becomes (-1,1) and (1,1). This is also linear separable. In short, chance is more for a non-linear separable data in lower-dimensional space to become linear separable in higher-dimensional space.
或者咱們也能夠把一維數據轉換成兩維數據。咱們可使用函數 f (x) =(x, x2 ) 對數據進行映射。這樣 X 就變成了(-3,9)和(3,9)而 O 就變成了 (-1,1)和(1,1)。一樣能夠線性分割,簡單來講就是在低維空間不能線性分割的數據在高維空間頗有可能能夠線性分割。
In general, it is possible to map points in a d-dimensional space to some D-dimensional space (D>d) to check the possibility of linear separability. There is an idea which helps to compute the dot product in the high-dimensional (kernel) space by performing computations in the low-dimensional input (feature) space. We can illustrate with following example.
一般咱們能夠將 d 維數據映射到 D 維數據來檢測是否能夠線性分割(D>d)。這種想法能夠幫助咱們經過對低維輸入(特徵)空間的計算來得到高 維空間的點積。咱們能夠用下面的例子說明。
Consider two points in two-dimensional space, p=(p1,p2) and q=(q1,q2). Let ϕ be a mapping function which maps a two-dimensional point to three-dimensional space as follows:
ϕ(p)=(p21,p22,2‾√p1p2)ϕ(q)=(q21,q22,2‾√q1q2)
Let us define a kernel function K(p,q) which does a dot product between two points, shown below:
K(p,q)=ϕ(p).ϕ(q)ϕ(p).ϕ(q)=ϕ(p)Tϕ(q)=(p21,p22,2‾√p1p2).(q21,q22,2‾√q1q2)=p1q1+p2q2+2p1q1p2q2=(p1q1+p2q2)2=(p.q)2
假設咱們有二維空間的兩個點:p = (p1, p2) 和 q = (q1, q2)。用 Ø 表示映 射函數,它能夠按以下方式將二維的點映射到三維空間中:
咱們要定義一個核函數 K (p, q),它能夠用來計算兩個點的內積,以下所示
It means, a dot product in three-dimensional space can be achieved using squared dot product in two-dimensional space. This can be applied to higher dimensional space. So we can calculate higher dimensional features from lower dimensions itself. Once we map them, we get a higher dimensional space.
這說明三維空間中的內積能夠經過計算二維空間中內積的平方來得到。這 能夠擴展到更高維的空間。因此根據低維的數據來計算它們的高維特徵。在進 行完映射後,咱們就獲得了一個高維空間數據。
In addition to all these concepts, there comes the problem of misclassification. So just finding decision boundary with maximum margin is not sufficient. We need to consider the problem of misclassification errors also. Sometimes, it may be possible to find a decision boundary with less margin, but with reduced misclassification. Anyway we need to modify our model such that it should find decision boundary with maximum margin, but with less misclassification. The minimization criteria is modified as:
min||w||2+C(distanceofmisclassifiedsamplestotheircorrectregions)
除了上面的這些概念以外,還有一個問題須要解決,那就是分類錯誤。僅 僅找到具備最大邊緣的決定邊界是不夠的。咱們還須要考慮錯誤分類帶來的誤 差。有時咱們找到的決定邊界的邊緣可能不是最大的可是錯誤分類是最少的。 因此咱們須要對咱們的模型進行修正來找到一個更好的決定邊界:最大的邊緣, 最小的錯誤分類。評判標準就被修改成:
Below image shows this concept. For each sample of the training data a new parameter ξi is defined. It is the distance from its corresponding training sample to their correct decision region. For those who are not misclassified, they fall on their corresponding support planes, so their distance is zero.
下圖顯示這個概念。對於訓練數據的每個樣本又增長了一個參數 ξi。它 表示訓練樣本到他們所屬類(實際所屬類)的超平面的距離。對於那些分類正 確的樣本這個參數爲 0,由於它們會落在它們的支持平面上。
image
So the new optimization problem is :
minw,b0L(w,b0)=||w||2+C∑iξi subject to yi(wTxi+b0)≥1−ξi and ξi≥0 ∀i
How should the parameter C be chosen? It is obvious that the answer to this question depends on how the training data is distributed. Although there is no general answer, it is useful to take into account these rules:
Large values of C give solutions with less misclassification errors but a smaller margin. Consider that in this case it is expensive to make misclassification errors. Since the aim of the optimization is to minimize the argument, few misclassifications errors are allowed.
Small values of C give solutions with bigger margin and more classification errors. In this case the minimization does not consider that much the term of the sum so it focuses more on finding a hyperplane with big margin.
如今新的最優化問題就變成了:
參數 C 的取值應該如何選擇呢?很明顯應該取決於你的訓練數據。雖然沒 有一個統一的答案,可是在選取 C 的取值時咱們仍是應該考慮一下下面的規 則:
若是 C 的取值比較大,錯誤分類會減小,可是邊緣也會減少。其實就是 錯誤分類的代價比較高,懲罰比較大。(在數據噪聲很小時咱們能夠選取 較大的 C 值。)
若是 C 的取值比較小,邊緣會比較大,但錯誤分類的數量會升高。其實 就是錯誤分類的代價比較低,懲罰很小。整個優化過程就是爲了找到一個 具備最大邊緣的超平面對數據進行分類。(若是數據噪聲比較大時,應該 考慮)