[Scikit-learn] 1.5 Generalized Linear Models - SGD for Regression

梯度降低

1、親手實現「梯度降低」

如下內容其實就是《手動實現簡單的梯度降低》。html

神經網絡的實踐筆記,主要包括:git

Link: http://peterroelants.github.io/posts/neural_network_implementation_part01/github

中文版: http://www.jianshu.com/p/0da9eb3fd06bweb

 

1. 生成訓練數據

「目標函數+隨機噪聲」生成。算法

import numpy as np import matplotlib.pyplot as plt

# Part 1, create training data
# Define the vector of input samples as x, with 20 values sampled from a uniform distribution # between 0 and 1
x = np.random.uniform(0, 1, 20) # Generate the target values t from x with small gaussian noise so the estimation won't be perfect. # Define a function f that represents the line that generates t without noise def f(x): return x * 2 # Create the targets t with some gaussian noise noise_variance = 0.2 # Variance of the gaussian noise # Gaussian noise error for each sample in x
# shape函數是numpy.core.fromnumeric中的函數,它的功能是讀取矩陣的長度,好比shape[0]就是讀取矩陣第一維度的長度。
# shape[0]獲得對應的長度,也便是真實值的個數,以便生成對應的noise值

noise = np.random.randn(x.shape[0]) * noise_variance # Create targets t t = f(x) + noise
# Part2, draw the training data
# Plot the target t versus the input x plt.plot(x, t, 'o', label='t') # Plot the initial line plt.plot([0, 1], [f(0), f(1)], 'b-', label='f(x)') plt.xlabel('$x$', fontsize=15) plt.ylabel('$t$', fontsize=15) plt.ylim([0,2]) plt.title('inputs (x) vs targets (t)') plt.grid() plt.legend(loc=2) plt.show()

 

 

2. loss與weight的關係

# 定義「神經網絡模型「
def nn(x, w): return x*w # 定義「損失函數」
def cost(y, t): return ((t - y) ** 2).sum() # Plot the cost vs the given weight w

# Define a vector of weights for which we want to plot the cost
#
start 是採樣的起始點

# stop 是採樣的終點
# num 是採樣的點個數
ws = np.linspace(0, 4, num=100) # weight values
cost_ws = np.vectorize(lambda w: cost(nn(x, w) , t))(ws)  # cost for each weight in ws

# Plot
plt.plot(ws, cost_ws, 'r-') plt.xlabel('$w$', fontsize=15) plt.ylabel('$\\xi$', fontsize=15) plt.title('cost vs. weight') plt.grid() plt.show()

 

 

3. 梯度模擬

# define the gradient function. Remember that y = nn(x, w) = x * w
def gradient(w, x, t): return 2 * x * (nn(x, w) - t) # define the update function delta w
def delta_w(w_k, x, t, learning_rate): return learning_rate * gradient(w_k, x, t).sum() # Set the initial weight parameter
w = 0.1
# Set the learning rate
learning_rate = 0.1

# Start performing the gradient descent updates, and print the weights and cost:
nb_of_iterations = 4  # number of gradient descent updates
w_cost = [(w, cost(nn(x, w), t))] # List to store the weight,costs values
for i in range(nb_of_iterations): dw = delta_w(w, x, t, learning_rate)   # Get the delta w update
    w  = w - dw                            # Update the current weight parameter
    w_cost.append((w, cost(nn(x, w), t)))  # Add weight,cost to list

# Print the final w, and cost
for i in range(0, len(w_cost)): print('w({}): {:.4f} \t cost: {:.4f}'.format(i, w_cost[i][0], w_cost[i][1])) 
# Plot the first 2 gradient descent updates plt.plot(ws, cost_ws, 'r-') # Plot the error curve # Plot the updates for i in range(0, len(w_cost)-2): w1, c1 = w_cost[i] w2, c2 = w_cost[i+1] plt.plot(w1, c1, 'bo') plt.plot([w1, w2],[c1, c2], 'b-') plt.text(w1, c1+0.5, '$w({})$'.format(i)) # Show figure plt.xlabel('$w$', fontsize=15) plt.ylabel('$\\xi$', fontsize=15) plt.title('Gradient descent updates plotted on cost function') plt.grid() plt.show()

 

w(0): 0.1000 cost: 25.1338
w(1): 2.5774 cost: 2.7926 
w(2): 1.9036 cost: 1.1399 
w(3): 2.0869 cost: 1.0177 
w(4): 2.0370 cost: 1.0086 bash

 

4. 預測的效果

w = 0 # Start performing the gradient descent updates
nb_of_iterations = 10  # number of gradient descent updates
for i in range(nb_of_iterations): dw = delta_w(w, x, t, learning_rate)  # get the delta w update
    w = w - dw  # update the current weight parameter
    

# Plot the fitted line agains the target line # Plot the target t versus the input x
plt.plot(x, t, 'o', label='t') # Plot the initial line
plt.plot([0, 1], [f(0), f(1)], 'b-', label='f(x)') # plot the fitted line
plt.plot([0, 1], [0*w, 1*w], 'r-', label='fitted line') plt.xlabel('input x') plt.ylabel('target t') plt.ylim([0,2]) plt.title('input vs. target') plt.grid() plt.legend(loc=2) plt.show()

 

 

2、封裝在API

以上是實現細節,在scikit-learn中被封裝成了以下精簡的API。網絡

Ref: [Scikit-learn] 1.1. Generalized Linear Models - Neural network modelsapp

mlp = MLPClassifier(verbose=0, random_state=0, max_iter=max_iter, **param) mlp.fit(X, y) mlps.append(mlp)

 

 

 

隨機梯度降低

1、基本介紹

Ref: 1.5. Stochastic Gradient Descentdom

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression.ide

    • Logistic Regression 是 模型
    • SGD 是 算法,也就是 「The solver for weight optimization.」 權重優化方法。

SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing.

Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.

 

nlp Feature: 

    • Loss function: 凸
    • 數據大且稀疏(維度高)

 

The advantages of Stochastic Gradient Descent are:

  • Efficiency.
  • Ease of implementation (lots of opportunities for code tuning).

The disadvantages of Stochastic Gradient Descent include:

  • SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
  • SGD is sensitive to feature scaling.

 

 

2、函數接口

1.5. Stochastic Gradient Descent

 

函數名

SGDRegressor 類實現了一個簡單的 隨機梯度降低的學習算法的程序,該程序支持不一樣的損失函數和罰項 來擬合線性迴歸模型。 
    • SGDRegressor 對於很是大的訓練樣本(>10.000)的迴歸問題是很是合適的。
    • 對於其餘問題咱們推薦 Ridge, Lasso 或者 ElasticNet 。

函數參數

具體損失函數能夠經過設置 loss 參數。 SGDRegressor 支持如下幾種損失函數:

    • loss="squared_loss": Ordinary least squares,  // 能夠用於魯棒迴歸

    • loss="huber": Huber loss for robust regression,  // 能夠用於魯棒迴歸
    • loss="epsilon_insensitive": linear Support Vector Regression.  // insensitive區域的寬度能夠 經過參數 epsilon 指定,該參數由目標變量的規模來決定。

 

 

3、模型比較

SGDRegressor 和 SGDClassifier 同樣支持平均SGD。Averaging 能夠經過設置 `average=True` 來啓用。

對於帶平方損失和L2罰項的迴歸,提供了另一個帶平均策略的SGD的變體,使用了隨機平均梯度算法(SAG), 實現程序爲Ridge 。 # 難道不是直接套公式?而是採用逼近法估參?

問題來了:from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor

前四種方法:LinearRegression, Lasso, Ridge, ElasticNet

from sklearn.cross_validation import KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
np.set_printoptions(precision=2, linewidth=120, suppress=True, edgeitems=4)

# In order to do multiple regression we need to add a column of 1s for x0
x = np.array([np.concatenate((v,[1])) for v in boston.data])
y = boston.target


a = 0.3
for name,met in [
        ('linear regression', LinearRegression()),
        ('lasso', Lasso(fit_intercept=True, alpha=a)),
        ('ridge', Ridge(fit_intercept=True, alpha=a)),
        ('elastic-net', ElasticNet(fit_intercept=True, alpha=a))
        ]:
    met.fit(x,y)
    # p = np.array([met.predict(xi) for xi in x])
    p = met.predict(x)
    e = p-y
    total_error = np.dot(e,e)
    rmse_train = np.sqrt(total_error/len(p))

    kf = KFold(len(x), n_folds=10)
    err = 0
    for train,test in kf:
        met.fit(x[train],y[train])
        p = met.predict(x[test])
        e = p-y[test]
        err += np.dot(e,e)

    rmse_10cv = np.sqrt(err/len(x))
    print('Method: %s' %name)
    print('RMSE on training: %.4f' %rmse_train)
    print('RMSE on 10-fold CV: %.4f' %rmse_10cv)
    print("\n")
View Code
Method: linear regression
RMSE on training: 4.6795
RMSE on 10-fold CV: 5.8819


Method: lasso
RMSE on training: 4.8570
RMSE on 10-fold CV: 5.7675


Method: ridge
RMSE on training: 4.6822
RMSE on 10-fold CV: 5.8535


Method: elastic-net
RMSE on training: 4.9072
RMSE on 10-fold CV: 5.4936

 

Stochastic Gradient Descent 實踐

# SGD is very senstitive to varying-sized feature values. So, first we need to do feature scaling.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(x) x_s == SGDRegressor(penalty='l2', alpha=0.15, n_iter=200) sgdreg.fit(x_s,y) p = sgdreg.predict(x_s) err = p-y total_error = np.dot(err,err) rmse_train = np.sqrt(total_error/len(p)) # Compute RMSE using 10-fold x-validation
kf = KFold(len(x), n_folds=10) xval_err = 0 for train,test in kf: scaler = StandardScaler() scaler.fit(x[train]) # Don't cheat - fit only on training data
    xtrain_s = scaler.transform(x[train]) xtest_s = scaler.transform(x[test])  # apply same transformation to test data
 sgdreg.fit(xtrain_s,y[train]) p = sgdreg.predict(xtest_s) e = p-y[test] xval_err += np.dot(e,e) rmse_10cv = np.sqrt(xval_err/len(x)) method_name = 'Stochastic Gradient Descent Regression'
print('Method: %s' %method_name) print('RMSE on training: %.4f' %rmse_train) print('RMSE on 10-fold CV: %.4f' %rmse_10cv)
Method: Stochastic Gradient Descent Regression
RMSE on training: 4.8119
RMSE on 10-fold CV: 5.5741

 

End.

相關文章
相關標籤/搜索