機器學習之支持向量機原理和sklearn實踐

時間 2019-12-04

標籤機器學習支持向量原理 sklearn 實踐简体版

原文原文鏈接

1. 場景描述

問題:如何對對下圖的線性可分數據集和線性不可分數據集進行分類？
python

思路:編程

(1)對線性可分數據集找到最優分割超平面
(2)將線性不可分數據集經過某種方法轉換爲線性可分數據集

下面將帶着這兩個問題對支持向量機相關問題進行總結數據結構

2. 如何找到最優分割超平面

通常地，當訓練數據集線性可分時，存在無窮個分離超平面可將兩類數據正確分開，好比感知機求得的分離超平面就有無窮多個，爲了求得惟一的最優分離超平面，就須要使用間隔最大化的支持向量機框架

2.1 分類預測確信度

上圖中，有A,B,C三個點，表示三個示例，均在分離超平面的正類一側，點A距分離超平面較遠，若預測點爲正類，就比較確信預測是正確的；點C距離超平面較近，若預測該點爲正類，就不那麼確信；點B介於點A與C之間，預測其爲正類的確信度也在A與C之間dom

經過上面的描述，當訓練集中的全部數據點都距離分隔平面足夠遠時，確信度就越大。在超平面\(w^T X + b = 0\)肯定的狀況下，能夠經過函數間隔和幾何間隔來肯定數據點離分割超平面的距離機器學習

2.2 函數間隔

對於給定的訓練數據集T和超平面（w,b）,定義超平面（w,b）關於樣本\(（x_i,y_i）\)的函數間隔爲：\[\overline{\gamma{_i}} = y_i(w\bullet{x_i} + b)\]ide

定義超平面（w,b）關於訓練數據集T的函數間隔爲超平面（w,b）關於T中全部樣本點\(（x_i,y_i）\)的函數間隔之最小值：\[\overline{\gamma} = \min\limits_{i=1,...,N}\overline{\gamma{_i}}\]函數

函數間隔能夠表示分類預測的正確性及確信度，可是選擇分離超平面時，只有函數間隔倒是不夠的學習

2.3 幾何間隔

對於給定的訓練數據集T和超平面（w,b）,定義超平面（w,b）關於樣本點\(（x_i,y_i）\)的幾何間隔爲：\[\gamma{_i} = \frac{y_i(w\bullet{x_i} + b)}{||w||}\]測試

定義超平面（w,b）關於訓練數據集T的幾何間隔爲超平面（w,b）關於T中全部樣本點\(（x_i,y_i）\)的幾何間隔之最小值：\[\gamma = \min\limits_{i=1,...,N}\gamma_i\]

2.4 函數間隔和幾何間隔之間的關係

從上面函數間隔和幾何間隔的定義，能夠獲得函數間隔和幾何間隔之間的關係：\[\gamma_i = \frac{\overline{\gamma_i}}{||w||}\]

\[\gamma = \frac{\overline{\gamma}}{||w||}\]

2.5 硬間隔最大化分離超平面

支持向量機學習的基本想法是找到可以正確劃分訓練數據集而且幾何間隔最大的分離超平面，換句話說也就是不只將正負實例點分開，並且對最難分的實例點（離超平面最近的點）也有足夠大的確信度將它們分開,硬間隔是與後面說明的軟間隔相對應的

如何求得一個幾何間隔最大化的分離超平面，能夠表示爲下面的約束優化問題：\[\max\limits_{w,b}\quad\gamma\]

\[s_.t.\quad\frac{y_i(w\bullet{x_i}+b)}{||w||}\geq\gamma,\quad{i=1,2,...,N}\]

根據上面函數間隔和幾何間隔之間的關係，轉換成下面的同等約束問題：\[\max\limits_{w,b}\quad\frac{\overline{\gamma}}{||w||}\]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq\overline{\gamma},\quad{i=1,2,...,N}\]

因爲當w,b按比例變換的的時候函數間隔\(\overline\gamma\)也會呈比例變化，先取\(\overline\gamma= 1\),再因爲\(\frac{1}{||w||}\)最大化和最小化\(\frac{1}{2}{||w||}^2\)是等價的，因而獲得：\[\min\limits_{w,b}\quad\frac{1}{2}{||w||^2}\]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq 1,\quad{i=1,2,...,N}\]

由此獲得分離超平面：\[w^{*} \bullet x + b^{*} = 0\]

分類決策函數：\[f(x) = sign(w^{*} \bullet x + b^{*})\]

求解拉格朗日對偶函數：\[L(w,b,a) = \frac{1}{2}{||w||}^2 - \sum_{i=1}^na_i[(y_i(x_iw+b)-1)]----(1)\]
對w求偏導:\[\frac{\partial L}{\partial w} = w - \sum_{i=1}^na_iy_ix_i = 0-----(2)\]
對b求偏導:\[\frac{\partial L}{\partial b} = \sum_{i=1}^na_iy_i = 0-------(3)\]
將（2）（3）帶入（1）獲得：\[maxL(a) = -\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_ia_jy_iy_jx_ix_j + \sum_{i=1}^na_i\]

\[s.t. \quad \sum_{i=1}^na_iy_i = 0\]

\[a_i >= 0\]

2.6 軟間隔最大化分離超平面

對於線性可分的數據集能夠直接使用硬間隔最大化超平面進行劃分，但對於線性不可分的某些樣本點不能知足函數間隔大於等於1的約束條件，爲了解決這個問題，能夠對每一個樣本點\((x_i,y_i)\)引進一個鬆弛變量\(\xi >= 0\),使函數間隔加上鬆弛變量大於等於1，這樣約束條件變爲：\[yi(w\bullet x_i + b) >= 1- \xi_{i}\]

同時，對每一個鬆弛變量\(\xi_{i}\)支付一個代價\(\xi_{i},目標函數由原來的\)\(\frac{1}{2}{||w||}^2\)\(變爲\)\(\frac{1}{2}{||w||}^2 + C\sum_{i=1}^n{\xi_i}\)

C爲懲罰係數，通常由應用問題決定，C值大時對誤分類的懲罰增大，C值小時對誤分類懲罰小

線性不可分的線性支持向量機的學習問題編程以下凸二次規劃問題：\[\min\limits_{w,b,\xi}\quad\frac{1}{2}{||w||^2}+ C\sum_{i=1}^n{\xi_i}\]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq 1 - \xi_{i},\quad{i=1,2,...,N}\]

\[\xi_{i} >= 0,\quad i = 1,2,...,N\]

由此獲得分離超平面：\[w^{*} \bullet x + b^{*} = 0\]

分類決策函數：\[f(x) = sign(w^{*} \bullet x + b^{*})\]

拉格朗日對偶函數：
\[maxL(a) = -\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_ia_jy_iy_jx_ix_j + \sum_{i=1}^na_i\]

\[s.t. \quad \sum_{i=1}^na_iy_i = 0\]

\[a_i >= 0\]

\[\mu_i >= 0\]

\[C-a_i-\mu_i = 0\]

2.7 支持向量和間隔邊界

在線性可分的狀況下，訓練數據集的樣本點中與分離超平面距離最近的樣本點的示例稱爲支持向量，支持向量是使約束條件成立的點，即\[\quad\ y_i(w\bullet{x_i}+b) - 1 = 0\]或\[yi(w\bullet x_i + b) - (1- \xi_{i}) = 0\],在\(y_i = +1\)的正例點，支持向量在超平面\[H_1:w^Tx + b = 1\]上，對\(y_i = -1\)的負例點，支持向量在超平面\[H_2:w^T x + b = -1\]上，此時\(H_1\)和\(H_2\)平行，而且沒有實例點落在它們中間，在\(H_1\)和\(H_2\)之間造成一條長帶，分離超平面與它們平行且位於它們中間，\(H_1和H_2\)之間的距離爲間隔，間隔依賴於分割超平面的法向量\(w\),等於\(\frac{2}{|w|}\),\(H_1和H_2\)爲間隔邊界，以下圖:

在決定分離超平面時只有支持向量起做用，而其餘實例點並不起做用。若是移動支持向量將改變所求的解；可是若是在間隔邊界之外移動其餘實例點，甚至去掉這些點，則解是不會變的，因爲支持向量在肯定分離超平面中起着決定性的做用，因此將這種分類稱爲支持向量機。支持向量的個數通常不多，因此支持向量機由不多的‘很重要的’訓練樣本肯定

3. 如何將線性不可分數據集轉換爲線性可分數據集

3.1 數據線性不可分的緣由

(1) 數據集自己就是線性不可分隔的

(2) 數據集中存在噪聲，或者人工對數據賦予分類標籤出錯等狀況的緣由致使數據集線性不可分

3.2 經常使用方法

將線性不可分數據集轉換爲線性可分數據集經常使用方法：

對於緣由(2)

須要修正模型，加上懲罰係數C,修正後的模型，能夠「容忍」模型錯誤分類的狀況，而且經過懲罰係數的約束，使得模型錯誤分類的狀況儘量合理

對於緣由(1)

(1)經過類似函數添加類似特徵
(2)使用核函數(多項式核、高斯RBF核)，將本來的低維特徵空間映射到一個更高維的特徵空間，從而使得數據集線性可分

3.3 核技巧在支持向量機中的應用

注意到在線性支持向量機的對偶問題中，不管是目標函數仍是決策函數都只涉及輸入實例與實例之間的內積，在對偶問題的目標函數中的內積\(x_ix_j\)能夠用核函數\[K(x_i,x_j) = \phi (x_i)\bullet \phi(x_j)\]代替，此時對偶問題的目標函數成爲\[maxL(a) = -\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_ia_jy_iy_jK(x_i,x_j) + \sum_{i=1}^na_i\]

一樣，分類決策函數中的內積也能夠用核函數代替\[f(x) = sign(\sum_{i=1}^na_i^*y_iK(x_i,x)+b^*)\]

4 使用sklearn框架訓練svm

SVM特別適用於小型複雜數據集，samples < 100k
硬間隔分類有兩個主要的問題：
- (1) 必需要線性可分
- (2) 對異常值特別敏感，會致使不能很好的泛化或沒法找不出硬間隔
使用軟間隔分類能夠解決硬間隔分類的兩個主要問題，儘量保存街道寬敞和限制間隔違例(即位於街道之上，甚至在錯誤一邊的實例)之間找到良好的平衡
在Sklean的SVM類中，能夠經過超參數C來控制這個平衡，C值越小，則街道越寬，可是違例會越多，若是SVM模型過分擬合，能夠試試經過下降C來進行正則化

4.1 線性可分LinearSVC類

4.1.1 LinearSVC類重要參數說明

penalty: string,'l1'or'l2',default='l2'
loss: string 'hing'or'squared_hinge',default='squared_hinge',hinge爲標準的SVM損失函數
dual: bool,defalut=True,wen n_samples > n_features,dual=False，SVM的原始問題和對偶問題兩者解相同
tol: float,deafult=le-4,用於提早中止標準
C: float,defult=1.0,爲鬆弛變量的懲罰係數
multi_class: 默認爲ovr，該參數不用修改
更多說明應查看源碼

4.1.2 Hinge損失函數

函數max(0,1-t),當t>=1時，函數等於0，若是t<1，其導數爲-1

def hinge(x):
    if x >=1 :
        return 0
    else:
        return 1-x

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2,4,20)
y = [hinge(i) for i in x ]
ax = plt.subplot(111)
plt.ylim([-1,2])
ax.plot(x,y,'r-')
plt.text(0.5,1.5,r'f(t) = max(0,1-t)',fontsize=20)
plt.show()

<Figure size 640x480 with 1 Axes>

4.1.3 LinearSVC實例

from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
print(iris.keys())
print('labels:',iris['target_names'])
features,labels = iris['data'],iris['target']
print(features.shape,labels.shape)

# 分析數據集
print('-------feature_names:',iris['feature_names'])
iris_df = pd.DataFrame(features)
print('-------info:',iris_df.info())
print('--------descibe:',iris_df.describe())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
labels: ['setosa' 'versicolor' 'virginica']
(150, 4) (150,)
-------feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
0    150 non-null float64
1    150 non-null float64
2    150 non-null float64
3    150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
-------info: None
--------descibe:                 0           1           2           3
count  150.000000  150.000000  150.000000  150.000000
mean     5.843333    3.057333    3.758000    1.199333
std      0.828066    0.435866    1.765298    0.762238
min      4.300000    2.000000    1.000000    0.100000
25%      5.100000    2.800000    1.600000    0.300000
50%      5.800000    3.000000    4.350000    1.300000
75%      6.400000    3.300000    5.100000    1.800000
max      7.900000    4.400000    6.900000    2.500000

# 數據進行預處理
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import LinearSVC
from scipy.stats import uniform

# 對數據進行標準化
scaler = StandardScaler()
X = scaler.fit_transform(features)
print(X.mean(axis=0))
print(X.std(axis=0))
# 對標籤進行編碼
encoder = LabelEncoder()
Y = encoder.fit_transform(labels)

# 調參
svc = LinearSVC(loss='hinge',dual=True)
param_distributions = {'C':uniform(0,10)}
rscv_clf =RandomizedSearchCV(estimator=svc, param_distributions=param_distributions,cv=3,n_iter=20,verbose=2)
rscv_clf.fit(X,Y)
print(rscv_clf.best_params_)

[-1.69031455e-15 -1.84297022e-15 -1.69864123e-15 -1.40924309e-15]
[1. 1. 1. 1.]
Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.140498369662586 .............................................
[CV] .............................. C=8.140498369662586, total=   0.0s
...
...
...
[CV] .............................. C=9.445168322251103, total=   0.0s
[CV] C=9.445168322251103 .............................................
[CV] .............................. C=9.445168322251103, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
{'C': 3.2357870215300046}

# 模型評估
y_prab = rscv_clf.predict(X)
result = np.equal(y_prab,Y).astype(np.float32)
print('accuracy:',np.sum(result)/len(result))

accuracy: 0.9466666666666667

from sklearn.metrics import accuracy_score,precision_score,recall_score

print('accracy_score:',accuracy_score(y_prab,Y))
print('precision_score:',precision_score(y_prab,Y,average='micro'))

accracy_score: 0.9466666666666667
precision_score: 0.9466666666666667

5 附錄

5.1 非線性SVM分類SVC

SVC類經過參數kernel的設置能夠實現線性和非線性分類，具體參數說明和屬性說明以下

5.1.1 SVC類參數說明

C: 懲罰係數，float,default=1.0
kernel: string,default='rbf',核函數選擇，必須爲('linear','poly','rbf','sigmoid','precomputed' or callable)其中一個
degree: 只有當kernel='poly'時纔有意義，表示多項式核的深度
gamma: float,default='auto',核係數
coef0,: float, optional (default=0.0),Independent term in kernel function,It is only significant in 'poly' and 'sigmoid',影響模型受高階多項式仍是低階多項式影響的結果
shrinking: bool,default=True
probability: bool,default=False
tol: 提早中止參數
cache_size:
class_weight: 類標籤權重
verbose: 日誌輸出類型
max_iter: 最大迭代次數
decision_function_shape: ‘ovo’,'ovr',defalut='ovr'
random_state:

5.1.2 SVC類屬性說明

support_:
support_vectors_:
n_support_:
dual_coef_:
coef_:
intercept_:
fit_status_:
probA_:
probB_:

5.1.3 核函數選擇

有這麼多核函數，該如何決定使用哪個呢？有一個經驗法則是，永遠先從線性核函數開始嘗試(LinearSVC比SVC(kernel='linear')快的多)，特別是訓練集很是大或特徵很是多的時候，若是訓練集不太大，能夠試試高斯RBF核，大多數狀況下它都很是好用。若是有多餘時間和精力，可使用交叉驗證核網格搜索來嘗試一些其餘的核函數，特別是那些專門針對你數據集數據結構的和函數

5.2 GridSearchCV類說明

5.2.1 GridSearchCV參數說明

estimator: 估算器，繼承於BaseEstimator
param_grid: dict,鍵爲參數名，值爲該參數須要測試值選項
scoring: default=None
fit_params:
n_jobs: 設置要並行運行的做業數，取值爲None或1，None表示1 job,1表示all processors,default=None
cv: 交叉驗證的策略數，None或integer,None表示默認3-fold, integer指定「(分層)KFold」中的摺疊數
verbose: 輸出日誌類型

5.2.2 GridSearchCV屬性說明

cv_results_: dict of numpy(masked) ndarray
best_estimator_:
best_score_: Mean cross-validated score of the best_estimator
best_params_:
best_index_: int,The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting
scorer_:
n_splits_: The number of cross-validation splits (folds/iterations)
refit_time: float

5.3 RandomizedSearchCV類說明

5.3.1 RandomizedSearchCV參數說明

estimator: 估算器，繼承於BaseEstimator
param_distributions: dict,鍵爲參數名，Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a ``rvs`` method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly
n_iter: 採樣次數，default=10
scoring: default=None
fit_params:
n_jobs: 設置要並行運行的做業數，取值爲None或1，None表示1 job,1表示all processors,default=None
cv: 交叉驗證的策略數，None或integer,None表示默認3-fold, integer指定「(分層)KFold」中的摺疊數
verbose: 輸出日誌類型

5.3.2 RandomizedSearchCV屬性說明

cv_results_: dict of numpy(masked) ndarray
best_estimators_:
best_score_: Mean cross-validated score of the best_estimator
best_params_:
best_index_: int,The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting
scorer_:
n_splits_: The number of cross-validation splits (folds/iterations)
refit_time: float