ML（4）——邏輯迴歸

時間 2019-11-17

標籤邏輯迴歸简体版

原文原文鏈接

　　Logistic Regression雖然名字裏帶「迴歸」，可是它其實是一種分類方法，「邏輯」是Logistic的音譯，和真正的邏輯沒有任何關係。html

模型

線性模型

　　因爲邏輯迴歸是一種分類方法，因此咱們仍然以最簡的二分類爲例。與感知機不一樣，對於邏輯迴歸的分類結果，y ∈ {0, 1}，咱們須要找到最佳的h_θ(x)擬合數據。算法

　　這裏容易聯想到線性迴歸。線性迴歸也能夠用於分類，可是不少時候，尤爲是二分類的時候，線性迴歸並不能很好地工做，由於分類不是連續的函數，其結果只能是固定的離散值。設想一下有線性迴歸獲得的擬合曲線h_θ(x)，當x→∞時，有可能y→∞，這就沒法對y ∈ {0, 1}進行有效解釋。app

　　對於二分類，邏輯迴歸的目的是找到一個函數，使得不管x取何值，都有：dom

　　知足這個式子的典型函數是sigmoid函數，也稱爲logistic函數：機器學習

　　在sigmoid函數g(z)中：ide

　　如今，將h_θ(x)賦予sigmoid函數g(z)的特性：函數

　　其中：學習

　　最終，邏輯迴歸的模型函數：優化

　　假設給定一些輸入，如今須要根據邏輯迴歸模型預測腫瘤是不是良性，最終獲得h_θ(x) = 0.8，能夠用機率表述：ui

　　上式表示在當前輸入下，y=1的機率是0.8，y=0的機率是0.2，由於是分類，因此判斷y = 1。

　　須要注意的是，sigmoid函數不是樣本點的分隔曲線，它表示的是邏輯迴歸的測結果；θ^Tx纔是分隔曲線，它將樣本點分爲θ^Tx ≥ 0和θ^Tx < 0兩部分：

分隔曲線

Sigmoid函數，最終模型

　　由此看來，邏輯迴歸的線性模型一樣是找到最佳的θ，使兩類樣本點分離，這就在很大程度上和感知機類似。

多項式模型

　　直觀地看，線性模型的決策邊界就是將兩類樣本點分離開的分隔曲線，咱們以前已經屢次接觸過，只是沒有給它起一個專業的名字。假設在一個模型中，h_θ(x) = g(θ₀+ θ₁x₁ + θ₂x₂) = g(-3 + x₁ + x₂)，那麼決策邊界就是 -3 + x₁ + x₂= 0：

　　不少時候，直線並不能很好地做爲決策邊界，以下圖所示：

　　此時須要使用多項式模型添加更多的特徵：

　　這至關於添加了兩個新的特徵：

　　深刻了解不一樣函數的特徵有助於選擇正確的模型。添加的特徵越多，曲線越複雜，對訓練樣本的擬合度越高，同時也更容易致使過擬合而喪失泛化性：

　　關於過擬合及其處理，可參考《ML（附錄3）——數據擬合和正則化》。

多分類

　　咱們已經知道了二分類的模型，然而實際問題中每每不僅是二分類，那麼邏輯迴歸如何處理多分類呢？

　　一種可行的方法就是化繁爲簡，將多分類轉換爲二分類：

　　如上圖所示，三分類轉換爲三個二分類，其中上標表示分類的類別：

　　對於輸入x，其預測結果是全部h_θ(x)中值最大的一個。對於最後的預測結論，以上面的三分類爲例，若是輸入一個標籤爲2的特徵集，對於h_θ⁽⁰⁾(x) 來講，h_θ⁽⁰⁾(x) < 0.5：

　　對於h_θ⁽¹⁾(x) 來講，h_θ⁽¹⁾(x) < 0.5：

　　對於h_θ⁽³⁾(x) 來講，h_θ⁽³⁾(x) ≥ 0.5：

　　所以，對於輸入x，其預測結果是全部h_θ(x)中值最大的一個。至於每一個樣本標籤值是多少，無所謂了，在訓練每一個h_θ⁽ⁱ⁾(x)前，都須要把y⁽ⁱ⁾轉換爲1，其他轉換爲0。

　　在上面的圖組中也能夠看出，對於有k個標籤的多分類，須要訓練k個不一樣的邏輯迴歸模型。

學習策略

　　在感知機中，模型函數使用了sign，因爲sign是階躍函數，不易優化，因此爲了求得損失函數，咱們使用函數間隔進行了一系列轉換。對於邏輯迴歸，因爲函數自己是連續曲線，因此不會存在這樣的問題，其J(θ)使用對數損失函數。對數損失函數的原型：

　　將其用在邏輯迴歸上：

　　這裏的對數是以e爲底的對數，即：

　　注意，0 < h_θ(x) < 1。上圖是y = 1時的costfunction，能夠看到，當h_θ(x)→1時，Cost(h_θ(x), y)→0；當h_θ(x)→0時，Cost(h_θ(x),y)→∞。也就是說當分類是1時，sigmoid的值越接近於1，損失值越小；sigmoid的值越接近於0，損失值越大。損失值越大，分類點越接近決策面，其分類越模糊。與此相似，下圖是 y = 0時的cost function：

　　Cost function能夠把y = 1和y=1兩種狀況合併到一塊兒：

　　理解這種形式須要再次將y分開：

　　最後獲得J(θ)的最終形式：

　　注意，這裏h_θ(x)是sigmoid函數：

　　若是寫成矩陣的形式：

　　上式中，h_θ(x)的操做對應矩陣中的每一個元素，1-Y，log h_θ(x)也同樣，可參照後文的代碼實現來理解。

算法

　　與以前的算法同樣，咱們的目的是找到最佳的θ使得J(θ)最小化，將求解θ轉換爲最優化問題，即：

梯度降低

　　梯度降低是一個適用範圍很廣的方法，這裏一樣可使用梯度降低求解θ：

　　關於梯度降低的更多內容可參考《ML（附錄1）——梯度降低》。

　　在對J(θ)求偏導時先作一些準備工做，計算sigmoid函數的導數（關於偏導和一元函數的導數，可參考《多變量微積分》和《單變量微積分》的相關章節）：

　　如今計算m=1時J(θ)的偏導，此時能夠刪除上標：

　　推廣到m個樣本：

　　在《機器學習實戰》中提到對極大似然數使用梯度上升求最大值，最後獲得：

　　這和對損失函數採用梯度降低求最小值是同樣的，由於損失函數使用了似然數的負數形式，Cost(X, Y) = -logP(Y|X)，因此對-logP(Y|X)梯度降低和對+logP(Y|X)梯度上升將獲得一樣的結果。

　　對於多項式模型，須要預先添加特徵，使得每一個θ_j都有惟一的x_j對應：

　　若是用矩陣表示：

　　在此基礎上使用L2正則化（關於正則化，可參考《ML（附錄3）——過擬合與欠擬合》）：

代碼實現

ex2data1.txt：

  1 34.62365962451697,78.0246928153624,0
  2 30.28671076822607,43.89499752400101,0
  3 35.84740876993872,72.90219802708364,0
  4 60.18259938620976,86.30855209546826,1
  5 79.0327360507101,75.3443764369103,1
  6 45.08327747668339,56.3163717815305,0
  7 61.10666453684766,96.51142588489624,1
  8 75.02474556738889,46.55401354116538,1
  9 76.09878670226257,87.42056971926803,1
 10 84.43281996120035,43.53339331072109,1
 11 95.86155507093572,38.22527805795094,0
 12 75.01365838958247,30.60326323428011,0
 13 82.30705337399482,76.48196330235604,1
 14 69.36458875970939,97.71869196188608,1
 15 39.53833914367223,76.03681085115882,0
 16 53.9710521485623,89.20735013750205,1
 17 69.07014406283025,52.74046973016765,1
 18 67.94685547711617,46.67857410673128,0
 19 70.66150955499435,92.92713789364831,1
 20 76.97878372747498,47.57596364975532,1
 21 67.37202754570876,42.83843832029179,0
 22 89.67677575072079,65.79936592745237,1
 23 50.534788289883,48.85581152764205,0
 24 34.21206097786789,44.20952859866288,0
 25 77.9240914545704,68.9723599933059,1
 26 62.27101367004632,69.95445795447587,1
 27 80.1901807509566,44.82162893218353,1
 28 93.114388797442,38.80067033713209,0
 29 61.83020602312595,50.25610789244621,0
 30 38.78580379679423,64.99568095539578,0
 31 61.379289447425,72.80788731317097,1
 32 85.40451939411645,57.05198397627122,1
 33 52.10797973193984,63.12762376881715,0
 34 52.04540476831827,69.43286012045222,1
 35 40.23689373545111,71.16774802184875,0
 36 54.63510555424817,52.21388588061123,0
 37 33.91550010906887,98.86943574220611,0
 38 64.17698887494485,80.90806058670817,1
 39 74.78925295941542,41.57341522824434,0
 40 34.1836400264419,75.2377203360134,0
 41 83.90239366249155,56.30804621605327,1
 42 51.54772026906181,46.85629026349976,0
 43 94.44336776917852,65.56892160559052,1
 44 82.36875375713919,40.61825515970618,0
 45 51.04775177128865,45.82270145776001,0
 46 62.22267576120188,52.06099194836679,0
 47 77.19303492601364,70.45820000180959,1
 48 97.77159928000232,86.7278223300282,1
 49 62.07306379667647,96.76882412413983,1
 50 91.56497449807442,88.69629254546599,1
 51 79.94481794066932,74.16311935043758,1
 52 99.2725269292572,60.99903099844988,1
 53 90.54671411399852,43.39060180650027,1
 54 34.52451385320009,60.39634245837173,0
 55 50.2864961189907,49.80453881323059,0
 56 49.58667721632031,59.80895099453265,0
 57 97.64563396007767,68.86157272420604,1
 58 32.57720016809309,95.59854761387875,0
 59 74.24869136721598,69.82457122657193,1
 60 71.79646205863379,78.45356224515052,1
 61 75.3956114656803,85.75993667331619,1
 62 35.28611281526193,47.02051394723416,0
 63 56.25381749711624,39.26147251058019,0
 64 30.05882244669796,49.59297386723685,0
 65 44.66826172480893,66.45008614558913,0
 66 66.56089447242954,41.09209807936973,0
 67 40.45755098375164,97.53518548909936,1
 68 49.07256321908844,51.88321182073966,0
 69 80.27957401466998,92.11606081344084,1
 70 66.74671856944039,60.99139402740988,1
 71 32.72283304060323,43.30717306430063,0
 72 64.0393204150601,78.03168802018232,1
 73 72.34649422579923,96.22759296761404,1
 74 60.45788573918959,73.09499809758037,1
 75 58.84095621726802,75.85844831279042,1
 76 99.82785779692128,72.36925193383885,1
 77 47.26426910848174,88.47586499559782,1
 78 50.45815980285988,75.80985952982456,1
 79 60.45555629271532,42.50840943572217,0
 80 82.22666157785568,42.71987853716458,0
 81 88.9138964166533,69.80378889835472,1
 82 94.83450672430196,45.69430680250754,1
 83 67.31925746917527,66.58935317747915,1
 84 57.23870631569862,59.51428198012956,1
 85 80.36675600171273,90.96014789746954,1
 86 68.46852178591112,85.59430710452014,1
 87 42.0754545384731,78.84478600148043,0
 88 75.47770200533905,90.42453899753964,1
 89 78.63542434898018,96.64742716885644,1
 90 52.34800398794107,60.76950525602592,0
 91 94.09433112516793,77.15910509073893,1
 92 90.44855097096364,87.50879176484702,1
 93 55.48216114069585,35.57070347228866,0
 94 74.49269241843041,84.84513684930135,1
 95 89.84580670720979,45.35828361091658,1
 96 83.48916274498238,48.38028579728175,1
 97 42.2617008099817,87.10385094025457,1
 98 99.31500880510394,68.77540947206617,1
 99 55.34001756003703,64.9319380069486,1
100 74.77589300092767,89.52981289513276,1

View Code

Octave

%% Machine Learning Online Class - Exercise 2: Logistic Regression
%
%  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the logistic
%  regression exercise. You will need to complete the following functions 
%  in this exericse:
%
%     sigmoid.m
%     costFunction.m
%     predict.m
%     costFunctionReg.m
%
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%

%% Initialization
clear ; close all; clc

%% Load Data
%  The first two columns contains the exam scores and the third column
%  contains the label.

data = load('ex2data1.txt');
X = data(:, [1, 2]); y = data(:, 3);

%% ==================== Part 1: Plotting ====================
%  We start the exercise by first plotting the data to understand the 
%  the problem we are working with.

fprintf(['Plotting data with + indicating (y = 1) examples and o ' ...
         'indicating (y = 0) examples.\n']);

plotData(X, y);

% Put some labels 
hold on;
% Labels and Legend
xlabel('Exam 1 score')
ylabel('Exam 2 score')

% Specified in plot order
legend('Admitted', 'Not admitted')
hold off;

fprintf('\nProgram paused. Press enter to continue.\n');
pause;


%% ============ Part 2: Compute Cost and Gradient ============
%  In this part of the exercise, you will implement the cost and gradient
%  for logistic regression. You neeed to complete the code in 
%  costFunction.m

%  Setup the data matrix appropriately, and add ones for the intercept term
[m, n] = size(X);

% Add intercept term to x and X_test
X = [ones(m, 1) X];

% Initialize fitting parameters
initial_theta = zeros(n + 1, 1);

% Compute and display initial cost and gradient
[cost, grad] = costFunction(initial_theta, X, y);

fprintf('Cost at initial theta (zeros): %f\n', cost);
fprintf('Expected cost (approx): 0.693\n');
fprintf('Gradient at initial theta (zeros): \n');
fprintf(' %f \n', grad);
fprintf('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628\n');

% Compute and display cost and gradient with non-zero theta
test_theta = [-24; 0.2; 0.2];
[cost, grad] = costFunction(test_theta, X, y);

fprintf('\nCost at test theta: %f\n', cost);
fprintf('Expected cost (approx): 0.218\n');
fprintf('Gradient at test theta: \n');
fprintf(' %f \n', grad);
fprintf('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n');

fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%% ============= Part 3: Optimizing using fminunc  =============
%  In this exercise, you will use a built-in function (fminunc) to find the
%  optimal parameters theta.

%  Set options for fminunc
options = optimset('GradObj', 'on', 'MaxIter', 400);
%  Run fminunc to obtain the optimal theta
%  This function will return theta and the cost 
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
% Print theta to screen
fprintf('Cost at theta found by fminunc: %f\n', cost);
fprintf('Expected cost (approx): 0.203\n');
fprintf('theta: \n');
fprintf(' %f \n', theta);
fprintf('Expected theta (approx):\n');
fprintf(' -25.161\n 0.206\n 0.201\n');

% Plot Boundary
plotDecisionBoundary(theta, X, y);

% Put some labels 
hold on;
% Labels and Legend
xlabel('Exam 1 score')
ylabel('Exam 2 score')

% Specified in plot order
legend('Admitted', 'Not admitted')
hold off;


fprintf('\nProgram paused. Press enter to continue.\n');
pause;

%% ============== Part 4: Predict and Accuracies ==============
%  After learning the parameters, you'll like to use it to predict the outcomes
%  on unseen data. In this part, you will use the logistic regression model
%  to predict the probability that a student with score 45 on exam 1 and 
%  score 85 on exam 2 will be admitted.
%
%  Furthermore, you will compute the training and test set accuracies of 
%  our model.
%
%  Your task is to complete the code in predict.m

%  Predict probability for a student with score 45 on exam 1 
%  and score 85 on exam 2 

prob = sigmoid([1 45 85] * theta);
fprintf(['For a student with scores 45 and 85, we predict an admission ' ...
         'probability of %f\n'], prob);
fprintf('Expected value: 0.775 +/- 0.002\n\n');

% Compute accuracy on our training set
p = predict(theta, X);

fprintf('Train Accuracy: %f\n', mean(double(p == y)) * 100);
fprintf('Expected accuracy (approx): 89.0\n');
fprintf('\n');

　　plotData.m

function plotData(X, y)
%PLOTDATA Plots the data points X and y into a new figure 
%   PLOTDATA(x,y) plots the data points with + for the positive examples
%   and o for the negative examples. X is assumed to be a Mx2 matrix.

% Create New Figure
  figure; hold on;

% Instructions: Plot the positive and negative examples on a
%               2D plot, using the option 'k+' for the positive
%               examples and 'ko' for the negative examples.
  pos = find(y==1); neg = find(y == 0);

  plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7);
  plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);

  hold off;

end

　　sigmoid.m

function g = sigmoid(z)
%SIGMOID Compute sigmoid function
%   g = SIGMOID(z) computes the sigmoid of z.

% You need to return the following variables correctly 
  g = ones(size(z)) ./ (1 + exp(-1 * z));

end

　　costFunction.m

function [J, grad] = costFunction(theta, X, y)
  %COSTFUNCTION Compute cost and gradient for logistic regression
  %   J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
  %   parameter for logistic regression and the gradient of the cost
  %   w.r.t. to the parameters.

  % Initialize some useful values
  m = length(y); % number of training examples

  % You need to return the following variables correctly 
  J = 0;
  grad = zeros(size(theta));

  % Instructions: Compute the cost of a particular choice of theta.
  %               You should set J to the cost.
  %               Compute the partial derivatives and set grad to the partial
  %               derivatives of the cost w.r.t. each parameter in theta
  %
  % Note: grad should have the same dimensions as theta
  %

  %  % use intrator to compue J
  %  for i = 1:m
  %    theta_X = X(i,:) * theta;
  %    h = 1 / (1 + exp(-1 * theta_X));
  %    J += y(i) * log(h) + (1 - y(i)) * log(1 - h);
  %  end
  %  J /= -1 * m;
  
  % use matrix to compute gradient
  h = sigmoid(X * theta);
  J = (y' * log(h) + (1 - y)' * log(1 - h)) / (-1 * m);
  #J /= -1 * m;
  
  grad = X' * (h - y) / m;
end

Python

  1 from __future__ import division
  2 import numpy as np
  3 import random
  4 import matplotlib.pyplot as plt
  5 
  6 def train(X, Y, iterateNum=10000000, alpha=0.003):
  7     '''
  8     :param X: 訓練樣本的特徵集
  9     :param Y: 訓練樣本的標籤
 10     :param iterateNum: 梯度降低的迭代次數
 11     :param alpha: 學習率
 12     :return:theta
 13     '''
 14     m, n = np.shape(X)
 15     theta = np.zeros((n + 1, 1))
 16     # 在第一列添加x0
 17     X_new = np.c_[np.ones(m), X]
 18 
 19     for i in range(iterateNum):
 20         m = np.shape(X_new)[0]
 21         h = h_function(X_new, theta)
 22         theta -= alpha * (np.dot(X_new.T, h - Y) / m)
 23 
 24         if i % 100000 == 0:
 25             print('\t---------iter=' + str(i) + ', J(θ)=' + str(J_function(X_new, Y, theta)))
 26 
 27     print( str(J_function(X_new, Y, theta)))
 28     return theta
 29 
 30 def h_function(X, theta):
 31     return sigmoid(np.dot(X, theta))
 32 
 33 def sigmoid(X):
 34     return 1 / (1 + np.exp(-X ))
 35 
 36 # 計算J(θ)
 37 def J_function(X, Y, theta):
 38     h = h_function(X, theta)
 39     J_1 = np.dot(Y.T, np.log(h))
 40     J_2 = np.dot(1 - Y.T, np.log(1 - h))
 41     m = np.shape(X)[0]
 42     J = (-1 / m) * (J_1 + J_2)
 43 
 44     return J
 45 
 46 def predict(x, theta):
 47     if h_function(x, theta) >= 0.5:
 48         return 1
 49     else:
 50         return 0
 51 
 52 # 歸一化處理
 53 def normalization(X):
 54     m, n = np.shape(X)
 55     X_new = np.zeros((m, n))
 56 
 57     for j in range(n):
 58         max = np.max(X[:,j])
 59         min = np.min(X[:,j])
 60         d_value = max - min
 61         for i in range(m):
 62             X_new[i, j] = (X[i, j] - min) / d_value
 63 
 64     return X_new
 65 
 66 def plot_datas(X, Y, theta):
 67     plt.figure()
 68 
 69     # 繪製分隔直線 g = 0
 70     x1 = [0, 1]
 71     x2 = [(-1 / theta[2]) * (theta[0] + theta[1] * x1[0]),
 72           (-1 / theta[2]) * (theta[0] + theta[1] * x1[1])]
 73     plt.xlabel('x1')
 74     plt.ylabel('x2')
 75 
 76     plt.plot(x1, x2, color='b')
 77 
 78     # 繪製數據點
 79     admit_x1, admit_x2 = [],[]
 80     not_admit_x1, not_admit_x2 = [],[]
 81     for i in range(len(X)):
 82         if (Y[i] == 1):
 83             admit_x1.append(X[i][0])
 84             admit_x2.append(X[i][1])
 85         else:
 86             not_admit_x1.append(X[i][0])
 87             not_admit_x2.append(X[i][1])
 88 
 89     plt.scatter(admit_x1, admit_x2, color='g')
 90     plt.scatter(not_admit_x1, not_admit_x2, marker='x', color='r')
 91 
 92     plt.legend(['logistic line', 'Admitted', 'Not admitted'])
 93     plt.show()
 94 
 95 if __name__ == '__main__':
 96     train_datas = np.loadtxt('ex2data1.txt', delimiter=',')
 97     X = train_datas[:,[0, 1]]
 98     X = normalization(X)
 99     Y = train_datas[:,[2]]
100     theta = train(X, Y)
101 
102     print(theta)
103     plot_datas(X, Y, theta)

　　對於樣本數據，上面的梯度降低並非很是有效，不管是否預處理數據（歸一化或其餘方法），都必須反覆調整學習率和迭代次數。若是學習率過大，算法將不會收斂；若是太小，算法收斂的十分緩慢，須要增長迭代次數。代碼中的參數最終將使算法收斂於0.203。

Sklearn

 1 from sklearn.linear_model import LogisticRegression
 2 import numpy as np
 3 
 4 if __name__ == '__main__':
 5     train_datas = np.loadtxt("ex2data1.txt", delimiter=',')
 6     X_train = train_datas[:,[0, 1]]
 7     Y_train = train_datas[:,[2]]
 8 
 9     logistic = LogisticRegression()
10     logistic.fit(X_train, Y_train)
11 
12     theta = [logistic.intercept_[0], logistic.coef_[0]]
13     print(theta)