不少同窗在學習機器學習的時候,理論粗略看一遍以後就直接上手編程了,很是值得表揚。可是他不是真正的上手寫算法,而是去直接調用 sklearn 這樣的 package,這就不大穩當了。筆者不是說調包很差,在實際工做和研究中,封裝好的簡單易用的 package 給咱們的工做帶來了莫大的便利,大大提升了咱們機器學習模型和算法的實現效率。但這僅限於使用過程當中。算法
筆者相信不少有企圖心的同窗確定不知足於僅僅去使用這些 package 而不知模型和算法的細節。因此,若是你是一名機器學習算法的學習者,在學習過程當中最好不要一上來就使用這些封裝好的包,而是根據本身對算法的理解,在手推過模型和算法的數學公式後,僅依靠 numpy 和 pandas 等基礎包的狀況下手寫機器學習算法。如此一遍過程以後,再去學習如何調用 sklearn 等機器學習庫,相信各位更能體會到調包的便利和樂趣。以後再去找數據實戰和打比賽作項目,相信你必定會成爲一名優秀的機器學習算法工程師。編程
本機器學習系列文章的兩個主題就是數學推導+純 numpy 實現。第一講咱們從最基礎的線性迴歸模型開始。相信你們對迴歸算法必定是至關熟悉了,特別是我們有統計背景的同窗。因此,筆者直接上數學推導。markdown
原本想着上筆者的手推草稿的,但字跡過於張揚,在 word 裏或者用 markdown 寫公式又太耗費時間,這裏就直接借用周志華老師的機器學習教材上的推導過程: app
以上即是線性迴歸模型中參數估計的推導過程。dom
按照慣例,動手寫算法以前咱們須要理一下編寫思路。迴歸模型主體部分較爲簡單,關鍵在於如何在給出 mse 損失函數以後基於梯度降低的參數更新過程。首先咱們須要寫出模型的主體和損失函數以及基於損失函數的參數求導結果,而後對參數進行初始化,最後寫出基於梯度降低法的參數更新過程。固然,咱們也能夠寫出交叉驗證來獲得更加穩健的參數估計值。話很少說,直接上代碼。機器學習
import numpy as np def linear_loss(X, y, w, b): num_train = X.shape[0] num_feature = X.shape[1] # 模型公式 y_hat = np.dot(X, w) + b # 損失函數 loss = np.sum((y_hat-y)**2)/num_train # 參數的偏導 dw = np.dot(X.T, (y_hat-y)) /num_train db = np.sum((y_hat-y)) /num_train return y_hat, loss, dw, db
參數初始化:函數
def initialize_params(dims): w = np.zeros((dims, 1)) b = 0 return w, b
def linar_train(X, y, learning_rate, epochs): w, b = initialize(X.shape[1]) loss_list = [] for i in range(1, epochs): # 計算當前預測值、損失和參數偏導 y_hat, loss, dw, db = linar_loss(X, y, w, b) loss_list.append(loss) # 基於梯度降低的參數更新過程 w += -learning_rate * dw b += -learning_rate * db # 打印迭代次數和損失 if i % 10000 == 0: print('epoch %d loss %f' % (i, loss)) # 保存參數 params = { 'w': w, 'b': b } # 保存梯度 grads = { 'dw': dw, 'db': db } return loss_list, loss, params, grads
以上即是線性迴歸模型的基本實現過程。下面以 sklearn 中的 diabetes 數據集爲例進行簡單的訓練。學習
from sklearn.datasets import load_diabetes from sklearn.utils import shuffle diabetes = load_diabetes() data = diabetes.data target = diabetes.target # 打亂數據 X, y = shuffle(data, target, random_state=13) X = X.astype(np.float32) # 訓練集與測試集的簡單劃分 offset = int(X.shape[0] * 0.9) X_train, y_train = X[:offset], y[:offset] X_test, y_test = X[offset:], y[offset:] y_train = y_train.reshape((-1,1)) y_test = y_test.reshape((-1,1)) print('X_train=', X_train.shape) print('X_test=', X_test.shape) print('y_train=', y_train.shape) print('y_test=', y_test.shape)
loss_list, loss, params, grads = linar_train(X_train, y_train, 0.001, 100000)
print(params)
下面定義一個預測函數對測試集結果進行預測:測試
def predict(X, params): w = params['w'] b = params['b'] y_pred = np.dot(X, w) + b return y_pred y_pred = predict(X_test, params) y_pred[:5]
import matplotlib.pyplot as plt f = X_test.dot(params['w']) + params['b'] plt.scatter(range(X_test.shape[0]), y_test) plt.plot(f, color = 'darkorange') plt.xlabel('X') plt.ylabel('y') plt.show()
可見全變量的數據對於線性迴歸模型的擬合併很差,一來數據自己的分佈問題,二來簡單的線性模型對於該數據擬合效果差。固然,咱們只是爲了演示線性迴歸模型的基本過程,不要在乎效果。spa
plt.plot(loss_list, color = 'blue') plt.xlabel('epochs') plt.ylabel('loss') plt.show()
筆者對上述過程進行一個簡單的 class 封裝,其中加入了自定義的交叉驗證過程進行訓練:
import numpy as np from sklearn.utils import shuffle from sklearn.datasets import load_diabetes class lr_model(): def __init__(self): pass def prepare_data(self): data = load_diabetes().data target = load_diabetes().target X, y = shuffle(data, target, random_state=42) X = X.astype(np.float32) y = y.reshape((-1, 1)) data = np.concatenate((X, y), axis=1) return data def initialize_params(self, dims): w = np.zeros((dims, 1)) b = 0 return w, b def linear_loss(self, X, y, w, b): num_train = X.shape[0] num_feature = X.shape[1] y_hat = np.dot(X, w) + b loss = np.sum((y_hat-y)**2) / num_train dw = np.dot(X.T, (y_hat - y)) / num_train db = np.sum((y_hat - y)) / num_train return y_hat, loss, dw, db def linear_train(self, X, y, learning_rate, epochs): w, b = self.initialize_params(X.shape[1]) for i in range(1, epochs): y_hat, loss, dw, db = self.linear_loss(X, y, w, b) w += -learning_rate * dw b += -learning_rate * db if i % 10000 == 0: print('epoch %d loss %f' % (i, loss)) params = { 'w': w, 'b': b } grads = { 'dw': dw, 'db': db } return loss, params, grads def predict(self, X, params): w = params['w'] b = params['b'] y_pred = np.dot(X, w) + b return y_pred def linear_cross_validation(self, data, k, randomize=True): if randomize: data = list(data) shuffle(data) slices = [data[i::k] for i in range(k)] for i in range(k): validation = slices[i] train = [data for s in slices if s is not validation for data in s] train = np.array(train) validation = np.array(validation) yield train, validation if __name__ == '__main__': lr = lr_model() data = lr.prepare_data() for train, validation in lr.linear_cross_validation(data, 5): X_train = train[:, :10] y_train = train[:, -1].reshape((-1, 1)) X_valid = validation[:, :10] y_valid = validation[:, -1].reshape((-1, 1)) loss5 = [] loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000) loss5.append(loss) score = np.mean(loss5) print('five kold cross validation score is', score) y_pred = lr.predict(X_valid, params) valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid) print('valid score is', valid_score)
以上即是本節的內容,基於 numpy 手動實現一個簡單的線性迴歸模型。
參考資料:周志華 機器學習