支持向量機（SVM）

時間 2019-12-05

標籤支持向量 svm 简体版

原文原文鏈接

簡介

SVM是支持向量機（Support Vector Machines）的簡稱，是一種二分類模型。html

支持向量機所作的就是去尋找兩類數據的分隔線，一般將這個分隔線叫作超平面。python

分隔的原則是間隔最大化，最終轉化爲一個凸二次規劃問題來求解。算法

基本思想

（1）線性分割dom

若是一個線性函數可以將樣本（好比說兩類點）分開，那麼稱這些數據樣本是線性可分的函數

線性函數在二維空間就是一條直線，在三維空間就是一個平面，在不考慮維數的狀況下線性函數統稱爲超平面。spa

如圖：實心點和空心點是兩類點3d

這兩類點，有無數條直線能夠將它們分隔開code

SVM的目標就是在這無數條線中，找出一條線使這條線與兩類樣本（兩類點）中最近的點距離最大orm

這個距離叫作間隔htm

但有時由於數據的問題不能畫出分割線時，例如這樣：

很顯然，分隔線是不存在的，這時能夠將這個點看成異常來處理

即忽略這個點，再分割

（2）非線性分割

還有一些數據是用一條直線分割不出來的，就像這樣：

在線性分割時，咱們向SVM提供樣本的座標數據（X，Y）即X和Y來進行計算

針對圖中的狀況，咱們能夠引入第三個數據Z，令Z = X^2 + Y^2

再畫出Z-X圖：

經過這一轉化，就能夠針對Z，X進行分割

再一個例子

樣本沒法直接根據X，Y來劃分，此時將樣本的座標轉化爲 |X|，Y 畫出圖像：

便可劃分

（3）核技巧（Kernel Trick）

核函數便是接受低緯度的輸入空間或者特徵空間，並將其映射到高維度空間，因此過去不能夠線性分割的內容變爲可分割的。

應用核函數技巧將輸入空間從x,y變到更大的輸入空間後，再使用支持向量機對數據點進行分類，獲得解後返回原始空間

這樣就獲得了一個非線性分割

核函數有不少種，其中經常使用的有：

linear（線性），ploy（多項式），rbf（徑向基函數），sigmoid（s函數）

在SVM算法中，核函數做爲一個參數發揮做用

（4）其餘參數（C、gamma）

不只是核做爲參數影響SVM分割，還有其餘參數，例如C和Gamma，其中C的影響最大，Gamma幾乎沒有影響

C的做用：在光滑的決策邊界以及儘量正確分類全部訓練點之間進行平衡（C越大正確點越多），但代價時模型更復雜

　　　　　C=10000,kernal='rbf'　　　　　　　　　　　　　　　C=10 kernel='rbf'

如圖核函數相同時，C相差三個數量級時的區別

經過調整核函數，C和Gamma來避免過擬合現象

代碼實現

環境：MacOS mojave　　10.14.3

Python　　3.7.0

使用庫：scikit-learn 0.19.2

sklearn.SVM庫官方文檔：https://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html

>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])#生成四個訓練點
>>> y = np.array([1, 1, 2, 2])#前兩個點分爲標籤1，後兩個點分爲標籤2
>>> from sklearn.svm import SVC
>>> clf = SVC(gamma='auto')#默認核函數線性分割
>>> clf.fit(X, y) #訓練
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)#使用rbf核函數，c爲1，gamma自動配置
>>> print(clf.predict([[-0.8, -1]]))    #預測新的點
[1]    #結果爲標籤1

以線性數據舉例

Main.py　　主程序

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import copy
import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()


########################## SVM #################################
### we handle the import statement and SVC creation for you here
from sklearn.svm import SVC
clf = SVC(kernel="linear",C = 1000)


#### now your job is to fit the classifier
#### using the training features/labels, and to
#### make a set of predictions on the test data
clf.fit(features_train, labels_train)


#### store your predictions in a list named pred

pred = []
pred = clf.predict(features_test)

prettyPicture(clf, features_test, labels_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print(acc)

def submitAccuracy():
    return acc

perp_terrain_data.py　　生成訓練點

#!/usr/bin/python
import random


def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
    random.seed(42)
    grade = [random.random() for ii in range(0,n_points)]
    bumpy = [random.random() for ii in range(0,n_points)]
    error = [random.random() for ii in range(0,n_points)]
    y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
    for ii in range(0, len(y)):
        if grade[ii]>0.8 or bumpy[ii]>0.8:
            y[ii] = 1.0

### split into train/test sets
    X = [[gg, ss] for gg, ss in zip(grade, bumpy)]
    split = int(0.75*n_points)
    X_train = X[0:split]
    X_test  = X[split:]
    y_train = y[0:split]
    y_test  = y[split:]

    grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]
    bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]
    grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]
    bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]

#    training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
#            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}


    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}

    return X_train, y_train, X_test, y_test
#    return training_data, test_data

class_vis.py　　繪圖與保存圖像

import warnings
warnings.filterwarnings("ignore")

import matplotlib 
matplotlib.use('agg')

import matplotlib.pyplot as plt
import pylab as pl
import numpy as np

#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()

def prettyPicture(clf, X_test, y_test):
    x_min = 0.0; x_max = 1.0
    y_min = 0.0; y_max = 1.0

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    h = .01  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

    # Plot also the test points
    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")
    plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")
    plt.legend()
    plt.xlabel("bumpiness")
    plt.ylabel("grade")

    plt.savefig("test.png")

獲得結果：正確率92%

以非線性核函數舉例

在數據點不變的狀況下，使用rbf核，並使C的值設定爲1000，10000，100000

clf = SVC(kernel="rbf",C = 100000)

正確率分別爲：92.4%　　 93.2%　　 94.4%

同時編譯時間略微變長

有時訓練集過大會使訓練時間很是長，此時咱們能夠經過縮小訓練集的方式來加快訓練速度。
在訓練分類器前加上這兩句代碼，可以使訓練集變爲本來的1%：

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]

SVM的優缺點

支持向量機在具備複雜領域和明顯分割邊界的狀況下，表現十分出色。可是，在海量數據集中，他的表現不太好，由於在這種規模的數據集中，訓練時間將是立方數。（速度慢）另外，在噪音過多的狀況下，效果也不太好。（可能會致使過擬合）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。