capable of performing linear and nonlinear classification, regression even outlier detection
particularly well suit for classification of complex but small or medium size datasets

Linear SVM Classification¶

不單單在匹配一條能夠準確分割兩類的線了，而是要fit出一條儘量寬的街道。
在兩個邊界以外的實例與decision boundary（決定邊界）沒有關係，影響邊界的只有街道里面的實例，被稱爲support vector（支持向量）
sensitive to the feature scale ; need feature scaling like StandardScaler

Soft Margin(邊緣) Classificationjavascript

all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street

import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]
y = (iris['target']==2).astype(np.float64)

svm_clf = Pipeline((
    ('scaler', StandardScaler()),
    ('linear_svc', LinearSVC(C=1, loss='hinge')),
))

svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))))

svm_clf.predict([[5.5, 1.7]])  #don't output probabilities

array([ 1.])

from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

svc_clf = Pipeline((
    ('scaler', StandardScaler()),
    ('svc', SVC(kernel='linear', C=1)),
))
m = len(X)
C =1
sgd_clf = Pipeline((
    ('scaler', StandardScaler()),
    ('sgd', SGDClassifier(loss='hinge', alpha=1/(m*C)))
))

svc_clf.fit(X, y)
sgd_clf.fit(X, y)
print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))

[ 1.] [ 1.]

sgd慢，但能夠處理巨大數據集和online classification task,svc更慢
LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
loss = hinge
for better preformance set dual=False, unless more features than training instances

Nonlinear SVM Classification¶

add more features like Polynomial to use linear Classification

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline((
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss='hinge'))
))

polynomial_svm_clf.fit(X, y)

Pipeline(steps=(('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))))

import matplotlib.pyplot as plt

def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

def plot_datasets(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], 'bs')
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], 'g^')
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r'$x_1$', fontsize=10)
    plt.ylabel(r'$x_2$', fontsize=10, rotation=0)
    
plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])

plt.show()

能夠使用grid search須要最優參數。經常首先粗略的搜索，在細緻的搜索

Adding Similarity Featurecss

add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
$$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距離，由此可計算出新的Features,並拋棄以前的$$
how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset

Gauss RBF Kernelhtml

數據集很大的時候計算很是費時
SVM的魔力就在於沒必要真實的添加feature就能夠得到與添加以後類似的結果

rbf_kernel_svm_clf = Pipeline((
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
))

rbf_kernel_svm_clf.fit(X,y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

%matplotlib inline
from sklearn.svm import SVC

gamma1, gamma2 = 0.1, 5
C1, C2 = 0.001, 1000
hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)

svm_clfs = []
for gamma, C in hyperparams:
    rbf_kernel_svm_clf = Pipeline((
            ("scaler", StandardScaler()),
            ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))
        ))
    rbf_kernel_svm_clf.fit(X, y)
    svm_clfs.append(rbf_kernel_svm_clf)

plt.figure(figsize=(14, 9))

for i, svm_clf in enumerate(svm_clfs):
    plt.subplot(221 + i)
    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])
    plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])
    gamma, C = hyperparams[i]
    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)

plt.show()

$\gamma$增大，instance的影響範圍就變小，bell-shape curve變瘦
$\gamma$增小，instance的影響範圍就變大，bell-shape curve變胖
$C$和$\gamma$相似

其餘的kernel用的少或者應用面窄。
String kernels經常使用在文本分類和DNA序列如：使用string subsequence kernel或者其餘基於Levenshtein distance
how to choose:
1. linear first, LinearSVC is fast
2. then SVC(kernel='linear')
3. not to large, try the rbf kernel
4. try others kernel using cross-validation and grid search

Computational Complexityhtml5

Class	Time Complexity	Out-of-core Support	Scaling required	Kernel Trick
LinearSVC	$O(m\times n)$	No	Yes	No
SGDClassifier	$O(m\times n)$	Yes	Yes	No
svc	$O(m^2\times n)\\ to\\ O(m^3\times n)$	No	Yes	Yes

SVM Regression¶

與分類器的區別在於要使support vector在margin被限制的條件下儘量多

from sklearn.svm import LinearSVR
import numpy.random as rnd
rnd.seed(42)
m = 50
X = 2 * rnd.rand(m, 1)
y = (4 + 3 * X + rnd.randn(m, 1)).ravel()

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

from sklearn.svm import SVR

svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)

svm_poly_reg.fit(X,y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

svm_reg1 = LinearSVR(epsilon=1.5)
svm_reg2 = LinearSVR(epsilon=0.5)
svm_reg1.fit(X, y)
svm_reg2.fit(X, y)

def find_support_vectors(svm_reg, X, y):
    y_pred = svm_reg.predict(X)
    off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)
    return np.argwhere(off_margin) #nonzero（知足條件）的位置 

svm_reg1.support_ = find_support_vectors(svm_reg1, X, y)  #其實是margin以外的
svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)

eps_x1 = 1
eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilon

def plot_svm_regression(svm_reg, X, y, axes):
    x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)
    y_pred = svm_reg.predict(x1s)
    plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")
    plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")
    plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")
    plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors='#FFAAAA')
    plt.plot(X, y, "bo")
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.legend(loc="upper left", fontsize=18)
    plt.axis(axes)

plt.figure(figsize=(9, 4))
plt.subplot(121)
plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)
plt.ylabel(r"$y$", fontsize=18, rotation=0)
#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)
plt.annotate(
        '', xy=(eps_x1, eps_y_pred), xycoords='data',
        xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),
        textcoords='data', arrowprops={'arrowstyle': '<->', 'linewidth': 1.5}
    )
plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)
plt.subplot(122)
plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)
plt.show()

Under the Hood¶

Decision Function and Predictionsjava

$$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]
y = (iris['target']==2).astype(np.float64)

from mpl_toolkits.mplot3d import Axes3D

def plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):
    x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])
    X_crop = X[x1_in_bounds]
    y_crop = y[x1_in_bounds]
    x1s = np.linspace(x1_lim[0], x1_lim[1], 20)
    x2s = np.linspace(x2_lim[0], x2_lim[1], 20)
    x1, x2 = np.meshgrid(x1s, x2s)
    xs = np.c_[x1.ravel(), x2.ravel()]
    df = (xs.dot(w) + b).reshape(x1.shape)
    m = 1 / np.linalg.norm(w)
    boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]
    margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]
    margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]
    ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100)
    ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")
    ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")
    ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)
    ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")
    ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")
    ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")
    ax.axis(x1_lim + x2_lim)
    ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)
    ax.set_xlabel(r"Petal length", fontsize=15)
    ax.set_ylabel(r"Petal width", fontsize=15)
    ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18)
    ax.legend(loc="upper left", fontsize=16)

svm_clf2 = LinearSVC(C=10, loss='hinge')
svm_clf2.fit(X, y)

fig = plt.figure(figsize=(11, 6))
ax1 = fig.add_subplot(111, projection='3d')
plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])

plt.show()

Training Objectivepython

要求全部的都分類正確（hard margin）$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
因而就轉化爲了約束優化問題$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之因此不用\left \| w \right \|是由於其不可微 $$
爲了獲得soft margin，一如鬆弛變量$:\ \varsigma^{(i)} \geq 0$,用來測量違反margin的程度
$w$表示斜率，越小margin越寬;$\varsigma$越小表示違反程度越低，但margin也會越小
$C$用來平衡$w$和$\varsigma$
$$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$

Quadratic Programmingjquery

$$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM

The Dual Problemlinux

a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
for SVM, the primal and dual problem has the same solution
Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$

Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$， without having to compute(or enev to know about) the transformaton $\phi$
$$ \begin{align*} Linear &:\ \ \ K(a,b)=a^T \cdot b \\ Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \\ Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \\ Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\ \end{align*} $$android

只要K知足Mercer's Condition(連續，可交換)就知道對於：$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$必定存在。儘管可能不知道是啥，但這樣就能夠作kernel
計算$\widehat w$時會有一個$\phi (x)$，多是很難計算的。將$\widehat w$帶入cost function中： $$ \begin{align*} h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \\ &= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \\ &= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \\ &= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\ \widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \\ &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\\ &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)})) \end{align*} $$

Online SVMscss3

means learning icrementrally
for Linear SVM Classifier: $$ J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b)) $$
$hinge\ loss\ function = max(0,\ 1-4)$

Notes ： Chapter 5

Linear SVM Classification¶

Nonlinear SVM Classification¶

SVM Regression¶

Under the Hood¶