公號:碼農充電站pro
主頁:https://codeshellme.github.iohtml
上一篇介紹了 SVM 的原理和一些基本概念,本篇來介紹如何用 SVM 處理實際問題。java
SVM 算法便可以處理分類問題,也能夠處理迴歸問題。python
sklearn 庫的 svm 包中實現了下面四種 SVM 算法:git
LinearSVC/R 中默認使用了線性核函數來出處理線性問題。github
SVC/R 可讓咱們選擇是使用線性核函數仍是高維核函數,當選擇線性核函數時,就能夠處理線性問題;當選擇高維核函數時,就能夠處理非線性問題。算法
對於線性問題,LinearSVC/R 比 SVC/R 更優秀,由於 LinearSVC/R 中作了優化,效率更高。shell
若是不知道數據集是否線性可分,可使用 SVC/R 來處理。網絡
下面主要介紹分類器,迴歸器的使用跟分類器大同小異。dom
先來看下 SVC 類的原型:函數
SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)
其中有幾個重要的參數:
再來看下 LinearSVC 類的原型:
LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)
在 LinearSVC 類中沒有 kernel 參數,由於默認使用了線性核函數。
sklearn 庫中自帶了一份乳腺癌數據集,下面就使用該數據來構造 SVM 分類器。
這份數據集中採集了一些患者的特徵,共包含 569 份數據,每份數據包含 31 個字段,用逗號隔開。在 569 份數據中,一共有 357 個是良性,212 個是惡性。
下面隨機抽取了 3 份數據,來看下:
16.13,20.68,108.1,798.8,0.117,0.2022,0.1722,0.1028,0.2164,0.07356,0.5692,1.073,3.854,54.18,0.007026,0.02501,0.03188,0.01297,0.01689,0.004142,20.96,31.48,136.8,1315,0.1789,0.4233,0.4784,0.2073,0.3706,0.1142,0 19.81,22.15,130,1260,0.09831,0.1027,0.1479,0.09498,0.1582,0.05395,0.7582,1.017,5.865,112.4,0.006494,0.01893,0.03391,0.01521,0.01356,0.001997,27.32,30.88,186.8,2398,0.1512,0.315,0.5372,0.2388,0.2768,0.07615,0 13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,0.05766,0.2699,0.7886,2.058,23.56,0.008462,0.0146,0.02387,0.01315,0.0198,0.0023,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259,1
下面表格給出了每一列數據表明的含義:
列數 | 含義 | 列數 | 含義 | 列數 | 含義 |
---|---|---|---|---|---|
1 | 半徑平均值 | 11 | 半徑標準差 | 21 | 半徑最大值 |
2 | 文理平均值 | 12 | 文理標準差 | 22 | 文理最大值 |
3 | 周長平均值 | 13 | 周長標準差 | 23 | 周長最大值 |
4 | 面積平均值 | 14 | 面積標準差 | 24 | 面積最大值 |
5 | 平滑程度平均值 | 15 | 平滑程度標準差 | 25 | 平滑程度最大值 |
6 | 緊密度平均值 | 16 | 緊密度標準差 | 26 | 緊密度最大值 |
7 | 凹度平均值 | 17 | 凹度標準差 | 27 | 凹度最大值 |
8 | 凹縫平均值 | 18 | 凹縫標準差 | 28 | 凹縫最大值 |
9 | 對稱性平均值 | 19 | 對稱性標準差 | 29 | 對稱性最大值 |
10 | 分形維數平均值 | 20 | 分形維數標準差 | 30 | 分形維數最大值 |
最後一列的含義表示:是不是良性腫瘤,0 表示惡性腫瘤,1 表示是良性腫瘤。
咱們能夠用 load_breast_cancer
函數來加載該數據集:
from sklearn.datasets import load_breast_cancer data = load_breast_cancer()
feature_names
屬性存儲了每列數據的含義:
>>> print data.feature_names ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
data
屬性存儲了特徵值:
>>> print data.data [[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01] [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02] [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02] ... [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02] [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01] [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
target
屬性存儲了目標值:
>>> print data.target [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1]
咱們知道數據集的前 30 列是特徵值,最後一列是目標值。經過觀察特徵值可知,30 列的特徵值,共有10 個維度,能夠分爲三大類:
所以,咱們在訓練 SVM 模型時,能夠只選擇其中一類做爲訓練集時的特徵值。
好比,這裏咱們選擇前 10 列特徵做爲訓練特徵(後 20 列忽略):
>>> features = data.data[:,0:10] # 特徵集 >>> labels = data.target # 目標集
將數據拆分爲訓練集和測試集:
from sklearn.model_selection import train_test_split train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
採用 Z-Score 規範化數據:
from sklearn.preprocessing import StandardScaler ss = StandardScaler() train_features = ss.fit_transform(train_features) test_features = ss.transform(test_features)
下面使用 SVC 類構造 SVM 分類器:
from sklearn.svm import SVC svc = svm.SVC() # 均使用默認參數
訓練模型:
svc.fit(train_features, train_labels)
用模型作預測:
prediction = svc.predict(test_features)
判斷模型準確率:
from sklearn.metrics import accuracy_score score = accuracy_score(test_labels, prediction) >>> print score 0.9414893617021277
能夠看到準確率爲 94%,說明訓練效果仍是很不錯的。
這裏是一個關於線性分類器 LinearSVC 的示例,你能夠了解一下。
sklearn 中實現了 SVM 算法,這裏咱們展現瞭如何用它處理實際問題。
除了 sklearn,LIBSVM 也實現了 SVM 算法,這也是一個很是出名的庫,你能夠自行研究一下。
(本節完。)
推薦閱讀:
歡迎關注做者公衆號,獲取更多技術乾貨。