數據分析與機器學習之線性迴歸與邏輯迴歸(六)

時間 2019-12-14

標籤數據分析機器學習線性迴歸邏輯欄目應用數學简体版

原文原文鏈接

一機器學習分類

有監督學習python

1 概述: 主要用於決策支持，它利用有標識的歷史數據進行訓練，以實現對新數據的表示的預測算法

2 分類: 分類計數預測的數據對象是離散的。如短信是否爲垃圾短信，用戶是否喜歡電子產品數組

好比: K近鄰、樸素貝葉斯、決策樹、SVMapp

3 迴歸: 迴歸技術預測的數據對象是連續值, 例如溫度變化或時間變化。包括一元迴歸和多元回機器學習

歸，線性迴歸和非線性迴歸: 例如 線性迴歸、邏輯迴歸、嶺迴歸函數
無監督學習學習

1 概述: 數據無標識, 主要用於知識發現，在歷史數據中發現隱藏的模式或內在結構測試

2 聚類: 聚類算法用於在數據中尋找隱藏的模式或分組。例如: K-meansspa
半監督學習3d

1 概述: 在半監督學習方式下，訓練數據有部分被標識，部分沒有被標識，這種模型首先須要學習數據的內在結構，以便合理的組織數據來進行預測。算法上，包括一些對經常使用監督式學習算法的延伸，這些算法首先試圖對未標識數據進行建模，在此基礎上再對標識的數據進行預測。

二線性迴歸數學原理

\[ \theta^T: 特徵權重的轉置, \theta本來表示一維列矩陣,轉置爲一維行矩陣,X是一維列矩陣,此時\theta^T x表明行乘列的數值 \]

迴歸問題偏差原理及公式推導

\[ \theta^Tx^{(i)} 表示預測值, y^{(i)}表示真實值,二者之間存在偏差\epsilon^{(i)} \]

\[ L(\theta) 似然函數: \theta^Tx^{(i)}要想越接近與y^{(i)},表明求和機率P應該越大越好 \]

矩陣求導過程省略,公式太多,都是筆記本上推導的

線性迴歸代碼實現原理

#導包
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets  #導入數據集
%matplotlib inline

構造線性迴歸類

#構造類
class LinearRegression():
    #初始化
    def __init__(self):
        self.w = None
    
    #進行訓練
    def fit(self,X,y):
        print(X.shape)  #(422, 1)
        X = np.insert(X,0,1,axis=1)  #在列中新增x0 = 1的操做相似插入數據
        print(X.shape)  #(422, 2)
        print(X)
        X_ = np.linalg.inv(X.T.dot(X)) #x的轉置dot(x)再取逆操做 
        self.w = X_.dot(X.T).dot(y) #再dot(x的轉置)dot(y)
        
    #進行預測
    def predict(self,X):
        X = np.insert(X,0,1,axis=1)
        y_pred = X.dot(self.w)
        return y_pred

預測值與測試值平方求均值

#將預測與預測值power2次方
def mean_squared_error(y_true,y_pred):
    mse = np.mean(np.power(y_true-y_pred,2))
    return mse

主函數執行

def main():
    #生成訓練/測試數據
    diabetes = datasets.load_diabetes()
    X = diabetes.data[:,np.newaxis,2]
    print(X.shape)  #(442, 1)
    x_train,x_test = X[:-20],X[-20:]
    y_train,y_test = diabetes.target[:-20],diabetes.target[-20:]
    
    #線性迴歸數據導入:訓練 預測
    clf = LinearRegression()
    clf.fit(x_train,y_train)
    y_pred = clf.predict(x_test)
    print(mean_squared_error(y_test,y_pred))
    
    #繪製圖形
    plt.scatter(x_test[:,0],y_test,color='black')
    plt.plot(x_test[:,0],y_pred,color='blue',linewidth=3)
    plt.show()

三邏輯斯蒂迴歸

邏輯斯蒂原理代碼

import matplotlib.pyplot as plt
import pandas as pd
pga = pd.read_csv('../Desktop/pga.csv')

#數據標準歸一化處理
pga.distance = (pga.distance - pga.distance.mean()) / pga.distance.std()
pga.accuracy = (pga.accuracy - pga.accuracy.mean()) / pga.accuracy.std()
pga.head()

plt.scatter(pga.distance,pga.accuracy)
plt.xlabel('distance')
plt.ylabel('accurancy')
plt.show()

目標函數的構建
\[ h_\theta(x) = \theta x + \theta_0 預測函數 \]

#目標函數(損失函數)
def cost(theta0,theta1,x,y):
    J=0
    m = len(x)
    for i in range(m):
        h = theta1*x[i] + theta0  #對應公式 h(x)值
        J += (h-y[i])**2  #目標函數 J = (h(x) - y)**2
    J /= (2*m)
    return J
print(cost(0,1,pga.distance,pga.accuracy))  #1.599438422599817

theta0 = 100
theta1s = np.linspace(-3,2,100)
costs = []
for theta1 in theta1s:
    costs.append(cost(theta0,theta1,pga.distance,pga.accuracy))
print(theta1s.shape)  #(100,)
plt.plot(theta1s,costs)
plt.show()

接下里咱們採用梯度降低法原理解析這類問題

#梯度降低解決問題
import numpy as np
from mpl_toolkits.mplot3d.axes3d import Axes3D  #導入3D包
import matplotlib.pyplot as plt
%matplotlib inline

theta0s = np.linspace(-2,2,100)
theta1s = np.linspace(-2,2,100)
COST = np.empty(shape=(100,100))  #空白填充(100,100)的數組
print(COST.shape)  #(100, 100)

TOS,TIS = np.meshgrid(theta0s,theta1s)
print(TOS.shape,TIS.shape)  #  (100, 100) (100, 100)

#將標準歸一化的數據替換新的數組並繪製
for i in range(100):
    for j in range(100):
        COST[i,j] = cost(TOS[0,i],TIS[j,0],pga.distance,pga.accuracy)
print(COST.shape)  #(100, 100)

fig2 = plt.figure()
ax = fig2.gca(projection='3d')
ax.plot_surface(X=TOS,Y=TIS,Z=COST)
plt.show()

#梯度降低實現原理:  對theta1與theta0進行求偏導值
#對theta1值進行求偏導值
def partial_cost_theta1(theta0,theta1,x,y):
    h = theta0 + theta1*x  #預測函數
    diff = (h-y) *x   # 對theta1進行求偏導  (h(x) - y) * x
    partial = diff.sum()/(x.shape[0])  #進行求和併除以樣本數量
    return partial
partial1 = partial_cost_theta1(0,5,pga.distance,pga.accuracy)
print(partial1)

#對theta0進行求偏導值
def partial_cost_theta0(theta0,theta1,x,y):
    h = theta0 + theta1*x   #預測函數
    diff = (h-y)      #對theta0求偏導  (h(x) - y)
    partial = diff.sum() / (x.shape[0])  #進行求和併除以樣本數量
    return partial

partial0 = partial_cost_theta0(1,1,pga.distance,pga.accuracy)
print(partial0)

#輸出
5.5791338540719
1.0000000000000104

使用梯度降低迭代更新值

#梯度降低迭代更新值  alpha=0.1表明默認步長
def gradient_descent(x,y,alpha=0.1,theta0=0,theta1=0):
    max_epochs = 1000   #迭代次數1000  
    counter = 0
    c = cost(theta1,theta0,pga.distance,pga.accuracy)
    costs = [c]
    convergence_thres = 0.00001  #定義降低趨勢設置臨界值精度
    cprev = c+ 10
    theta0s = [theta0]
    theta1s = [theta1]
    #判斷目標函數值大於臨界精度或者小於迭代次數,繼續迭代
    while (np.abs(cprev-c) > convergence_thres) and (counter < max_epochs):
        cprev = c
        update0 = alpha*partial_cost_theta0(theta0,theta1,x,y)  #alpha乘以 theta0求的偏導值
        update1 = alpha*partial_cost_theta1(theta0,theta1,x,y)  #alpha乘以 theta1求的偏導值
        #更新theta值,梯度降低
        theta0 -= update0
        theta1 -= update1
        #添加到列表中
        theta0s.append(theta0)
        theta1s.append(theta1)
        #計算新的cost值
        c = cost(theta0,theta1,pga.distance,pga.accuracy)
        costs.append(c)
        counter += 1
    return {'theta0':theta0,'theta1':theta1,'costs':costs}

#實現當迭代次數越多,計算的目標函數cost值越低並趨於平緩,從而實現找到目標函數趨近的最低值
print('theta1=',gradient_descent(pga.distance,pga.accuracy)['theta1'])
descend = gradient_descent(pga.distance,pga.accuracy,alpha=0.01)
plt.scatter(range(len(descend['costs'])),descend['costs'])
plt.xlabel('costs_len')
plt.ylabel('costs')
plt.show()

#輸出
theta1= -0.5984131176478865