線性迴歸（regression）

時間 2019-12-13

標籤線性迴歸 regression 欄目應用數學简体版

原文原文鏈接

簡介

迴歸分析只涉及到兩個變量的，稱一元迴歸分析。一元迴歸的主要任務是從兩個相關變量中的一個變量去估計另外一個變量，被估計的變量，稱因變量，可設爲Y；估計出的變量，稱自變量，設爲X。html

迴歸分析就是要找出一個數學模型Y=f(X)，使得從X估計Y能夠用一個函數式去計算。app

當Y=f(X)的形式是一個直線方程時，稱爲一元線性迴歸。這個方程通常可表示爲Y=A+BX。根據最小平方法或其餘方法，能夠從樣本數據肯定常數項A與迴歸係數B的值。dom

線性迴歸方程

Target：嘗試預測的變量，即目標變量函數

Input：輸入測試

Slope：斜率spa

Intercept:截距3d

舉例，有一個公司，每個月的廣告費用和銷售額，以下表所示：code

若是把廣告費和銷售額畫在二維座標內，就可以獲得一個散點圖，若是想探索廣告費和銷售額的關係，就能夠利用一元線性迴歸作出一條擬合直線：orm

有了這條擬合線，就能夠根據這條線大體的估算出投入任意廣告費得到的銷售額是多少。htm

評價迴歸線擬合程度的好壞

咱們畫出的擬合直線只是一個近似，由於確定不少的點都沒有落在直線上，那麼咱們的直線擬合的程度如何，換句話說，是否能準確的表明離散的點？在統計學中有一個術語叫作R^2（coefficient ofdetermination，中文叫斷定係數、擬合優度，決定係數），用來判斷迴歸方程的擬合程度。

要計算R^2首先須要瞭解這些：

總誤差平方和（又稱總平方和，SST，Sum of Squaresfor Total）：是每一個因變量的實際值（給定點的全部Y）與因變量平均值（給定點的全部Y的平均）的差的平方和，即，反映了因變量取值的整體波動狀況。以下：

迴歸平方和（SSR，Sum of Squares forRegression）：因變量的迴歸值（直線上的Y值）與其均值（給定點的Y值平均）的差的平方和，即，它是因爲自變量x的變化引發的y的變化，反映了y的總誤差中因爲x與y之間的線性關係引發的y的變化部分，是能夠由迴歸直線來解釋的。

殘差平方和（又稱偏差平方和，SSE，Sum of Squaresfor Error）:因變量的各實際觀測值(給定點的Y值)與迴歸值（迴歸直線上的Y值）的差的平方和，它是除了x對y的線性影響以外的其餘因素對y變化的做用，是不能由迴歸直線來解釋的。

SST（總誤差）=SSR（迴歸線能夠解釋的誤差）+SSE（迴歸線不能解釋的誤差）

所畫迴歸直線的擬合程度的好壞，其實就是看看這條直線（及X和Y的這個線性關係）可以多大程度上反映（或者說解釋）Y值的變化，定義

R^2=SSR/SST 或 R^2=1-SSE/SST

R^2的取值在0，1之間，越接近1說明擬合程度越好

代碼實現

環境：MacOS mojave　　10.14.3

Python　　3.7.0

使用庫：scikit-learn 0.19.2

sklearn.linear_model.LinearRegression官方庫：https://scikit-learn.org/stable/modules/linear_model.html

>>> from sklearn import linear_model
>>> reg = linear_model.LinearRegression()
>>> reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])#以（x,y）形式訓練
...                                       
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                 normalize=False)
>>> reg.coef_
array([0.5, 0.5])    #第一個是斜率，第二個是截距

舉例，以年齡與資產淨值爲例

圖中藍點是訓練數據，用於計算得出擬合曲線；紅點是測試數據，用於計算擬合曲線的擬合程度

均屬於樣本，僅僅是隨機分離出來。

Main.py　　主程序以及畫圖

import numpy
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt
from studentRegression import studentReg
from class_vis import prettyPicture

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()



reg = studentReg(ages_train, net_worths_train)


plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")

print ("katie's net worth prediction: ", reg.predict(27))  #預測結果
print ("r-squared score:",reg.score(ages_test,net_worths_test))
print ("slope:", reg.coef_)                    #獲取斜率
print ("intercept:" ,reg.intercept_)              #獲取截距

plt.savefig("test.png")

print ("\n ######## stats on test dataset ########\n")
print ("r-squared score: ",reg.score(ages_test,net_worths_test))  #經過使用測試集，能夠察覺到過擬合等狀況

print ("\n ######## stats on training dataset ########\n")
print ("r-squared score: ",reg.score(ages_train,net_worths_train))

plt.scatter(ages_train,net_worths_train)
plt.plot(ages_train,reg.predict(ages_train),color='blue',linewidth=3)
plt.xlabel('ages_train')
plt.ylabel('net_worths_train')
plt.show()

class_vis.py　　繪圖與保存圖像

import numpy as np
import matplotlib.pyplot as plt
import pylab as pl

def prettyPicture(clf, X_test, y_test):
    x_min = 0.0; x_max = 1.0
    y_min = 0.0; y_max = 1.0
    
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    h = .01  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

    # Plot also the test points
    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    plt.scatter(grade_sig, bumpy_sig, color = "b", )
    plt.scatter(grade_bkg, bumpy_bkg, color = "r",)
    plt.legend()
    plt.xlabel("bumpiness")
    plt.ylabel("grade")

    plt.savefig("test.png")

ages_net_worths.py　　樣本點數據

import numpy
import random

def ageNetWorthData():

    random.seed(42)
    numpy.random.seed(42)

    ages = []
    for ii in range(100):
        ages.append( random.randint(20,65) )
    net_worths = [ii * 6.25 + numpy.random.normal(scale=40.) for ii in ages]
### need massage list into a 2d numpy array to get it to work in LinearRegression
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    from sklearn.cross_validation import train_test_split
    ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths)

    return ages_train, ages_test, net_worths_train, net_worths_test

studentRegression.py　　線性迴歸

def studentReg(ages_train, net_worths_train):

    from sklearn import linear_model
    reg = linear_model.LinearRegression()
    reg.fit(ages_train, net_worths_train)
    
    
    return reg

獲得結果：

同時獲得：

R^2: 0.7889037259170789

slope: [[6.30945055]]

intercept: [-7.44716216]

擬合程度約爲0.79，還算能夠