機器學習就是須要找到模型的鞍點,也就是最優勢。由於模型不少時候並非徹底的凸函數,因此若是沒有好的優化方法可能會跑不到極值點,或者是局部極值,甚至是偏離。因此選擇一個良好的優化方法是相當重要的。首先是比較常規的優化方法:梯度降低。如下介紹的這些算法都不是用於當個算法,能夠試用於能可微的全部算法。 ###Gradient Descent 常見會用在logistics regression或者是linear regression裏面。好比logistics regression,這個模型直接等於0是求不出解析解的,全部只能用一些迭代方法求解。當樣本的label是 每個樣本正確機率:
最大似然函數
化簡:
target function就是如上的
,要作的就是優化上面的損失函數。
這個就是當前點的梯度,而函數的更新方向就是向着梯度的負方向進行的,因此最後是要減:git
def sigmoid(
x1, x2, theta1, theta2, theta3):
z = (theta1*x1+ theta2*x2+theta3).astype("float_")
return 1.0 / (1.0 + np.exp(-z))
'''cross entropy loss function '''
def cost(
x1, x2, y, theta_1, theta_2, theta_3):
sigmoid_probs = sigmoid(x1,x2, theta_1, theta_2,theta_3)
return -np.mean(y * np.log(sigmoid_probs)
+ (1 - y) * np.log(1 - sigmoid_probs))
複製代碼
sigmoid函數和cost函數,使用牛頓法代替梯度降低作邏輯迴歸。github
def gradient(
x_1, x_2, y, theta_1, theta_2, theta_3):
sigmoid_probs = sigmoid(x_1,x_2, theta_1, theta_2,theta_3)
return np.array([np.sum((y - sigmoid_probs) * x_1),
np.sum((y - sigmoid_probs) * x_2),
np.sum(y - sigmoid_probs)])
複製代碼
一階導數,數據就是二維的,再加上一個偏置值就是三個導數了。算法
'''calculate hassion matrix'''
def Hassion(
x1, x2, y, theta1, theta2, theta3):
sigmoid_probs = sigmoid(x1,x2, theta1,theta2,theta3)
d1 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 * x1)
d2 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 * x2)
d3 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 )
d4 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 * x1)
d5 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 * x2)
d6 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 )
d7 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 )
d8 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 )
d9 = np.sum((sigmoid_probs * (1 - sigmoid_probs)))
H = np.array([[d1, d2,d3],[d4, d5,d6],[d7,d8,d9]])
return H
複製代碼
方法比較笨,直接一個一個迭代。bash
'''update parameter by newton method'''
def newton_method(x1, x2, y):
#initialize parameter
theta1 = 0.001
theta2 = -0.4
theta3 = 0.6
delta_loss = np.Infinity
loss = cost(x1, x2, y, theta1, theta2, theta3)
kxi = 0.0000001
max_iteration = 10000
i = 0
print('while i = ', i, ' Loss = ', loss, 'delta_loss = ', delta_loss)
while np.abs(delta_loss) > kxi and i < max_iteration:
i += 1
g = gradient(x1, x2, y, theta1, theta2, theta3)
hassion = Hassion(x1, x2, y, theta1, theta2, theta3)
H_inverse = np.linalg.inv(hassion)
delta = H_inverse @ g.T
delta_theta_1 = delta[0]
delta_theta_2 = delta[1]
delta_theta_3 = delta[2]
theta1 += delta_theta_1
theta2 += delta_theta_2
theta3 += delta_theta_3
loss_new = cost(x1, x2, y, theta1, theta2, theta3)
delta_loss = loss - loss_new
loss = loss_new
print('while i = ', i, ' Loss = ', loss, 'delta_loss = ', delta_loss)
return np.array([theta1, theta2, theta3])
複製代碼
運行函數。app
def get_param_target(self):
np.random.seed(12)
num_observations = 5000
x1 = np.random.multivariate_normal([0, 0], [[1, .75], [.75, 1]], num_observations)
x2 = np.random.multivariate_normal([1, 4], [[1, .75], [.75, 1]], num_observations)
simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)
simulated_labels = np.hstack((np.zeros(num_observations),
np.ones(num_observations)))
return (simulated_separableish_features[:, 0], simulated_separableish_features[:, 1], simulated_labels)
複製代碼
生成數據。dom
線性迴歸的一階導數:$$\frac{dloss}{dw_j} = -\sum_{i=1}^{m}(y^i - \sum_{j=0}^{n-1}w_j*x_j^i)x_j^i$$ 線性迴歸的二階導數:$$\frac{dloss}{dw_jdw_k} = \sum_{i=1}^m(x_j^ix_k^i)$$ 對照上面的公式是能夠一一對應的。對於Armrji搜索只是用了第一條公式:$$f(x_k+\alpha p_k) <= f(x_k) + c_1 \alpha \nabla f_k^Tp_k$$一旦超過這個限制就能夠退出了。 機器學習
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from DataProcesser import DataProcesser
def bfgs(feature, label, lam, maxCycle):
n = np.shape(feature)[1]
w0 = np.mat(np.zeros((n, 1)))
rho = 0.55
sigma = 0.4
Bk = np.eye(n)
k = 1
while (k < maxCycle):
print('Iterator: ', k, ' error: ', get_error(feature, label, w0))
gk = get_gradient(feature, label, w0, lam)
dk = np.mat(-np.linalg.solve(Bk, gk))
m = 0
mk = 0
while (m < 20):
newf = get_result(feature, label, (w0 + rho**m*dk), lam)
oldf = get_result(feature, label, w0, lam)
if (newf < oldf + sigma * (rho ** m)*(gk.T*dk)[0, 0]):
mk = m
break
m += 1
#BFGS
w = w0 + rho**mk * dk
sk = w-w0
yk = get_gradient(feature, label, w, lam) - gk
if (yk.T * sk > 0):
Bk = Bk - (Bk * sk * sk.T * Bk) / (sk.T * Bk * sk) + (yk * yk.T) / (yk.T * sk)
k = k + 1
w0 = w
return w0
def get_error(feature, label, w):
return (label - feature * w).T*(label - feature * w)/2
def get_gradient(feature, label, w, lam):
err = (label - feature * w).T
left = err * (-1) * feature
return left.T + lam * w
def get_result(feature, label, w, lam):
left = (label - feature * w).T * (label - feature * w)
right = lam * w.T * w
return (left + right)/2
複製代碼
最主要的部分也就是公式那部分了,沒有什麼好講的。實現效果: 函數
###L-BFGS L-BFGS的出現仍是爲了節省,但此次是爲了節省內存。每一次存儲高緯度的數據須要的空間太多了,因此LBFGS的主要做用就是減小內存的開銷。再也不完整的存儲整一個矩陣了,它對BFGS算法作了近似,而是存儲計算時所須要的,而後利用哪兩個向量的計算來近似代替。對於
也不是存儲全部的,而是隻保存那麼後面的m個,也就是最新的m個就好。這樣就把空間複雜度
降到
。 毫無疑問,LBFGS是從BFGS演化而來的。對於逆矩陣的公式:$$H_{k+1}^{-1}=(I-\frac{s_ky_k^T}{y_k^Ts_k})H_k^{-1}(I-\frac{y_ks_k^T}{y_k^Ts_k})+\frac{s_ks_k^T}{y_k^Ts_k}$$ 假設
,改造一下上式:$$H_{k+1}^{-1} = v_k^TH_k^{-1}v_k+p_ks_ks_k^T$$ 假設:
遞推一下:oop
根據推論,通常的有:$$D_{k+1} = (v_k^Tv_{k-1}^T...v_1^Tv_0^T)D_0(v_0v_1...v_{k-1})\ +(v_k^Tv_{k-1}^T...v_2^Tv_1^T)(p_0s_0s_0^T)(v_1v_2...v_{k-1}v_k)\ +(v_k^Tv_{k-1}^T...v_3^Tv_2^T)(p_1s_1s_1^T)(v_2v_3...v_k-1v_k)\ +...\ +(v_k^Tv_{k-1}^T)(p_{k-2}s_{k-2}s_{k-2}^T)(v_{k-1}v_{k})\ +v_k^T(p_{k-1}s_{k-1}s_{k-1}^T)v_k\ +p_ks_ks_k^T$$ 能夠看的出,當前的海賽矩陣的逆矩陣是能夠由給出的,因此直接存儲對應的
便可,上面提到過爲了壓縮內存,能夠只是存儲後面的m個,因此更新一下公式:$$D_{k+1} = (v_k^Tv_{k-1}^T...v_{k-m+1}^Tv_{k-m}^T)D_0(v_{k-m}v_{k-m+1}...v_{k-1})\ +(v_k^Tv_{k-1}^T...v_{k-m+2}^Tv_{k-m+1}^T)(p_0s_0s_0^T)(v_{k-m+1}v_{k-m+2}...v_{k-1}v_k)\ +(v_k^Tv_{k-1}^T...v_{k-m+3}^Tv_{k-m+2}^T)(p_1s_1s_1^T)(v_{k-m+2}v_{k-m+3}...v_k-1v_k)\ +...\ +(v_k^Tv_{k-1}^T)(p_{k-2}s_{k-2}s_{k-2}^T)(v_{k-1}v_{k})\ +v_k^T(p_{k-1}s_{k-1}s_{k-1}^T)v_k\ +p_ks_ks_k^T$$ 公式長是有點長,可是已經有大神給出了一個很好的算法解決這個問題,two-loop-recursion算法:
end for
這個式子到底行不行呢?證實一下理論: 學習
def lbfgs(feature, label, lam, maxCycle, m = 10):
n = np.shape(feature)[1]
w0 = np.mat(np.zeros((n, 1)))
rho = 0.55
sigma = 0.4
H0 = np.eye(n)
s = []
y = []
k = 1
gk = get_gradient(feature, label, w0, lam)
dk = -H0 * gk
while (k < maxCycle):
print('iter: ', k, ' error:', get_error(feature, label, w0))
m1 = 0
mk = 0
gk = get_gradient(feature, label, w0, lam)
while (m1 < 20):
newf = get_result(feature, label, (w0 + rho ** m1 * dk), lam)
oldf = get_result(feature, label, w0, lam)
if newf < oldf + sigma * (rho ** m1) * (gk.T * dk)[0, 0]:
mk = m1
break
m1 = m1 + 1
w = w0 + rho ** mk * dk
if k > m:
s.pop(0)
y.pop(0)
sk = w - w0
qk = get_gradient(feature, label, w, lam)
yk = qk - gk
s.append(sk)
y.append(yk)
t = len(s)
a = []
for i in range(t):
alpha = (s[t - i -1].T * qk) / (y[t - i - 1].T * s[t - i - 1])
qk = qk - alpha[0, 0] * y[t - i -1]
a.append(alpha[0, 0])
r = H0 * qk
for i in range(t):
beta = (y[i].T * r) / (y[i].T * s[i])
r = r + s[i] * (a[t - i - 1] - beta[0, 0])
if yk.T * sk > 0:
dk = -r
k = k + 1
w0 = w
return w0
複製代碼
中間還要有一個判斷降低方向的過程。
####GitHub代碼:github.com/GreenArrow2…