Logistic Regression雖然名字裏帶「迴歸」,可是它其實是一種分類方法,「邏輯」是Logistic的音譯,和真正的邏輯沒有任何關係。html
因爲邏輯迴歸是一種分類方法,因此咱們仍然以最簡的二分類爲例。與感知機不一樣,對於邏輯迴歸的分類結果,y ∈ {0, 1},咱們須要找到最佳的hθ(x)擬合數據。算法
這裏容易聯想到線性迴歸。線性迴歸也能夠用於分類,可是不少時候,尤爲是二分類的時候,線性迴歸並不能很好地工做,由於分類不是連續的函數,其結果只能是固定的離散值。設想一下有線性迴歸獲得的擬合曲線hθ(x),當x→∞時,有可能y→∞,這就沒法對y ∈ {0, 1}進行有效解釋。app
對於二分類,邏輯迴歸的目的是找到一個函數,使得不管x取何值,都有:dom
知足這個式子的典型函數是sigmoid函數,也稱爲logistic函數:機器學習
在sigmoid函數g(z)中:ide
如今,將hθ(x)賦予sigmoid函數g(z)的特性:函數
其中:學習
最終,邏輯迴歸的模型函數:優化
假設給定一些輸入,如今須要根據邏輯迴歸模型預測腫瘤是不是良性,最終獲得hθ(x) = 0.8,能夠用機率表述:ui
上式表示在當前輸入下,y=1的機率是0.8,y=0的機率是0.2,由於是分類,因此判斷y = 1。
須要注意的是,sigmoid函數不是樣本點的分隔曲線,它表示的是邏輯迴歸的測結果;θTx纔是分隔曲線,它將樣本點分爲θTx ≥ 0和θTx < 0兩部分:
分隔曲線
Sigmoid函數,最終模型
由此看來,邏輯迴歸的線性模型一樣是找到最佳的θ,使兩類樣本點分離,這就在很大程度上和感知機類似。
直觀地看,線性模型的決策邊界就是將兩類樣本點分離開的分隔曲線,咱們以前已經屢次接觸過,只是沒有給它起一個專業的名字。假設在一個模型中,hθ(x) = g(θ0 + θ1x1 + θ2x2) = g(-3 + x1 + x2),那麼決策邊界就是 -3 + x1 + x2 = 0:
不少時候,直線並不能很好地做爲決策邊界,以下圖所示:
此時須要使用多項式模型添加更多的特徵:
這至關於添加了兩個新的特徵:
深刻了解不一樣函數的特徵有助於選擇正確的模型。添加的特徵越多,曲線越複雜,對訓練樣本的擬合度越高,同時也更容易致使過擬合而喪失泛化性:
關於過擬合及其處理,可參考《ML(附錄3)——數據擬合和正則化》。
咱們已經知道了二分類的模型,然而實際問題中每每不僅是二分類,那麼邏輯迴歸如何處理多分類呢?
一種可行的方法就是化繁爲簡,將多分類轉換爲二分類:
如上圖所示,三分類轉換爲三個二分類,其中上標表示分類的類別:
對於輸入x,其預測結果是全部hθ(x)中值最大的一個。對於最後的預測結論,以上面的三分類爲例,若是輸入一個標籤爲2的特徵集,對於hθ(0)(x) 來講,hθ(0)(x) < 0.5:
對於hθ(1)(x) 來講,hθ(1)(x) < 0.5:
對於hθ(3)(x) 來講,hθ(3)(x) ≥ 0.5:
所以,對於輸入x,其預測結果是全部hθ(x)中值最大的一個。至於每一個樣本標籤值是多少,無所謂了,在訓練每一個hθ(i)(x)前,都須要把y(i) 轉換爲1,其他轉換爲0。
在上面的圖組中也能夠看出,對於有k個標籤的多分類,須要訓練k個不一樣的邏輯迴歸模型。
在感知機中,模型函數使用了sign,因爲sign是階躍函數,不易優化,因此爲了求得損失函數,咱們使用函數間隔進行了一系列轉換。對於邏輯迴歸,因爲函數自己是連續曲線,因此不會存在這樣的問題,其J(θ)使用對數損失函數。對數損失函數的原型:
將其用在邏輯迴歸上:
這裏的對數是以e爲底的對數,即:
注意,0 < hθ(x) < 1。上圖是y = 1時的costfunction,能夠看到,當hθ(x)→1時,Cost(hθ(x), y)→0;當hθ(x)→0時,Cost(hθ(x),y)→∞。也就是說當分類是1時,sigmoid的值越接近於1,損失值越小;sigmoid的值越接近於0,損失值越大。損失值越大,分類點越接近決策面,其分類越模糊。與此相似,下圖是 y = 0時的cost function:
Cost function能夠把y = 1和y=1兩種狀況合併到一塊兒:
理解這種形式須要再次將y分開:
最後獲得J(θ)的最終形式:
注意,這裏hθ(x)是sigmoid函數:
若是寫成矩陣的形式:
上式中,hθ(x)的操做對應矩陣中的每一個元素,1-Y,log hθ(x)也同樣,可參照後文的代碼實現來理解。
與以前的算法同樣,咱們的目的是找到最佳的θ使得J(θ)最小化,將求解θ轉換爲最優化問題,即:
梯度降低是一個適用範圍很廣的方法,這裏一樣可使用梯度降低求解θ:
關於梯度降低的更多內容可參考《ML(附錄1)——梯度降低》。
在對J(θ)求偏導時先作一些準備工做,計算sigmoid函數的導數(關於偏導和一元函數的導數,可參考《多變量微積分》和《單變量微積分》的相關章節):
如今計算m=1時J(θ)的偏導,此時能夠刪除上標:
推廣到m個樣本:
在《機器學習實戰》中提到對極大似然數使用梯度上升求最大值,最後獲得:
這和對損失函數採用梯度降低求最小值是同樣的,由於損失函數使用了似然數的負數形式,Cost(X, Y) = -logP(Y|X),因此對-logP(Y|X)梯度降低和對+logP(Y|X)梯度上升將獲得一樣的結果。
對於多項式模型,須要預先添加特徵,使得每一個θj都有惟一的xj對應:
若是用矩陣表示:
在此基礎上使用L2正則化(關於正則化,可參考《ML(附錄3)——過擬合與欠擬合》):
ex2data1.txt:
1 34.62365962451697,78.0246928153624,0 2 30.28671076822607,43.89499752400101,0 3 35.84740876993872,72.90219802708364,0 4 60.18259938620976,86.30855209546826,1 5 79.0327360507101,75.3443764369103,1 6 45.08327747668339,56.3163717815305,0 7 61.10666453684766,96.51142588489624,1 8 75.02474556738889,46.55401354116538,1 9 76.09878670226257,87.42056971926803,1 10 84.43281996120035,43.53339331072109,1 11 95.86155507093572,38.22527805795094,0 12 75.01365838958247,30.60326323428011,0 13 82.30705337399482,76.48196330235604,1 14 69.36458875970939,97.71869196188608,1 15 39.53833914367223,76.03681085115882,0 16 53.9710521485623,89.20735013750205,1 17 69.07014406283025,52.74046973016765,1 18 67.94685547711617,46.67857410673128,0 19 70.66150955499435,92.92713789364831,1 20 76.97878372747498,47.57596364975532,1 21 67.37202754570876,42.83843832029179,0 22 89.67677575072079,65.79936592745237,1 23 50.534788289883,48.85581152764205,0 24 34.21206097786789,44.20952859866288,0 25 77.9240914545704,68.9723599933059,1 26 62.27101367004632,69.95445795447587,1 27 80.1901807509566,44.82162893218353,1 28 93.114388797442,38.80067033713209,0 29 61.83020602312595,50.25610789244621,0 30 38.78580379679423,64.99568095539578,0 31 61.379289447425,72.80788731317097,1 32 85.40451939411645,57.05198397627122,1 33 52.10797973193984,63.12762376881715,0 34 52.04540476831827,69.43286012045222,1 35 40.23689373545111,71.16774802184875,0 36 54.63510555424817,52.21388588061123,0 37 33.91550010906887,98.86943574220611,0 38 64.17698887494485,80.90806058670817,1 39 74.78925295941542,41.57341522824434,0 40 34.1836400264419,75.2377203360134,0 41 83.90239366249155,56.30804621605327,1 42 51.54772026906181,46.85629026349976,0 43 94.44336776917852,65.56892160559052,1 44 82.36875375713919,40.61825515970618,0 45 51.04775177128865,45.82270145776001,0 46 62.22267576120188,52.06099194836679,0 47 77.19303492601364,70.45820000180959,1 48 97.77159928000232,86.7278223300282,1 49 62.07306379667647,96.76882412413983,1 50 91.56497449807442,88.69629254546599,1 51 79.94481794066932,74.16311935043758,1 52 99.2725269292572,60.99903099844988,1 53 90.54671411399852,43.39060180650027,1 54 34.52451385320009,60.39634245837173,0 55 50.2864961189907,49.80453881323059,0 56 49.58667721632031,59.80895099453265,0 57 97.64563396007767,68.86157272420604,1 58 32.57720016809309,95.59854761387875,0 59 74.24869136721598,69.82457122657193,1 60 71.79646205863379,78.45356224515052,1 61 75.3956114656803,85.75993667331619,1 62 35.28611281526193,47.02051394723416,0 63 56.25381749711624,39.26147251058019,0 64 30.05882244669796,49.59297386723685,0 65 44.66826172480893,66.45008614558913,0 66 66.56089447242954,41.09209807936973,0 67 40.45755098375164,97.53518548909936,1 68 49.07256321908844,51.88321182073966,0 69 80.27957401466998,92.11606081344084,1 70 66.74671856944039,60.99139402740988,1 71 32.72283304060323,43.30717306430063,0 72 64.0393204150601,78.03168802018232,1 73 72.34649422579923,96.22759296761404,1 74 60.45788573918959,73.09499809758037,1 75 58.84095621726802,75.85844831279042,1 76 99.82785779692128,72.36925193383885,1 77 47.26426910848174,88.47586499559782,1 78 50.45815980285988,75.80985952982456,1 79 60.45555629271532,42.50840943572217,0 80 82.22666157785568,42.71987853716458,0 81 88.9138964166533,69.80378889835472,1 82 94.83450672430196,45.69430680250754,1 83 67.31925746917527,66.58935317747915,1 84 57.23870631569862,59.51428198012956,1 85 80.36675600171273,90.96014789746954,1 86 68.46852178591112,85.59430710452014,1 87 42.0754545384731,78.84478600148043,0 88 75.47770200533905,90.42453899753964,1 89 78.63542434898018,96.64742716885644,1 90 52.34800398794107,60.76950525602592,0 91 94.09433112516793,77.15910509073893,1 92 90.44855097096364,87.50879176484702,1 93 55.48216114069585,35.57070347228866,0 94 74.49269241843041,84.84513684930135,1 95 89.84580670720979,45.35828361091658,1 96 83.48916274498238,48.38028579728175,1 97 42.2617008099817,87.10385094025457,1 98 99.31500880510394,68.77540947206617,1 99 55.34001756003703,64.9319380069486,1 100 74.77589300092767,89.52981289513276,1
%% Machine Learning Online Class - Exercise 2: Logistic Regression % % Instructions % ------------ % % This file contains code that helps you get started on the logistic % regression exercise. You will need to complete the following functions % in this exericse: % % sigmoid.m % costFunction.m % predict.m % costFunctionReg.m % % For this exercise, you will not need to change any code in this file, % or any other files other than those mentioned above. % %% Initialization clear ; close all; clc %% Load Data % The first two columns contains the exam scores and the third column % contains the label. data = load('ex2data1.txt'); X = data(:, [1, 2]); y = data(:, 3); %% ==================== Part 1: Plotting ==================== % We start the exercise by first plotting the data to understand the % the problem we are working with. fprintf(['Plotting data with + indicating (y = 1) examples and o ' ... 'indicating (y = 0) examples.\n']); plotData(X, y); % Put some labels hold on; % Labels and Legend xlabel('Exam 1 score') ylabel('Exam 2 score') % Specified in plot order legend('Admitted', 'Not admitted') hold off; fprintf('\nProgram paused. Press enter to continue.\n'); pause; %% ============ Part 2: Compute Cost and Gradient ============ % In this part of the exercise, you will implement the cost and gradient % for logistic regression. You neeed to complete the code in % costFunction.m % Setup the data matrix appropriately, and add ones for the intercept term [m, n] = size(X); % Add intercept term to x and X_test X = [ones(m, 1) X]; % Initialize fitting parameters initial_theta = zeros(n + 1, 1); % Compute and display initial cost and gradient [cost, grad] = costFunction(initial_theta, X, y); fprintf('Cost at initial theta (zeros): %f\n', cost); fprintf('Expected cost (approx): 0.693\n'); fprintf('Gradient at initial theta (zeros): \n'); fprintf(' %f \n', grad); fprintf('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628\n'); % Compute and display cost and gradient with non-zero theta test_theta = [-24; 0.2; 0.2]; [cost, grad] = costFunction(test_theta, X, y); fprintf('\nCost at test theta: %f\n', cost); fprintf('Expected cost (approx): 0.218\n'); fprintf('Gradient at test theta: \n'); fprintf(' %f \n', grad); fprintf('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n'); fprintf('\nProgram paused. Press enter to continue.\n'); pause; %% ============= Part 3: Optimizing using fminunc ============= % In this exercise, you will use a built-in function (fminunc) to find the % optimal parameters theta. % Set options for fminunc options = optimset('GradObj', 'on', 'MaxIter', 400); % Run fminunc to obtain the optimal theta % This function will return theta and the cost [theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options); % Print theta to screen fprintf('Cost at theta found by fminunc: %f\n', cost); fprintf('Expected cost (approx): 0.203\n'); fprintf('theta: \n'); fprintf(' %f \n', theta); fprintf('Expected theta (approx):\n'); fprintf(' -25.161\n 0.206\n 0.201\n'); % Plot Boundary plotDecisionBoundary(theta, X, y); % Put some labels hold on; % Labels and Legend xlabel('Exam 1 score') ylabel('Exam 2 score') % Specified in plot order legend('Admitted', 'Not admitted') hold off; fprintf('\nProgram paused. Press enter to continue.\n'); pause; %% ============== Part 4: Predict and Accuracies ============== % After learning the parameters, you'll like to use it to predict the outcomes % on unseen data. In this part, you will use the logistic regression model % to predict the probability that a student with score 45 on exam 1 and % score 85 on exam 2 will be admitted. % % Furthermore, you will compute the training and test set accuracies of % our model. % % Your task is to complete the code in predict.m % Predict probability for a student with score 45 on exam 1 % and score 85 on exam 2 prob = sigmoid([1 45 85] * theta); fprintf(['For a student with scores 45 and 85, we predict an admission ' ... 'probability of %f\n'], prob); fprintf('Expected value: 0.775 +/- 0.002\n\n'); % Compute accuracy on our training set p = predict(theta, X); fprintf('Train Accuracy: %f\n', mean(double(p == y)) * 100); fprintf('Expected accuracy (approx): 89.0\n'); fprintf('\n');
plotData.m
function plotData(X, y) %PLOTDATA Plots the data points X and y into a new figure % PLOTDATA(x,y) plots the data points with + for the positive examples % and o for the negative examples. X is assumed to be a Mx2 matrix. % Create New Figure figure; hold on; % Instructions: Plot the positive and negative examples on a % 2D plot, using the option 'k+' for the positive % examples and 'ko' for the negative examples. pos = find(y==1); neg = find(y == 0); plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7); plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7); hold off; end
sigmoid.m
function g = sigmoid(z) %SIGMOID Compute sigmoid function % g = SIGMOID(z) computes the sigmoid of z. % You need to return the following variables correctly g = ones(size(z)) ./ (1 + exp(-1 * z)); end
costFunction.m
function [J, grad] = costFunction(theta, X, y) %COSTFUNCTION Compute cost and gradient for logistic regression % J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the % parameter for logistic regression and the gradient of the cost % w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta % % Note: grad should have the same dimensions as theta % % % use intrator to compue J % for i = 1:m % theta_X = X(i,:) * theta; % h = 1 / (1 + exp(-1 * theta_X)); % J += y(i) * log(h) + (1 - y(i)) * log(1 - h); % end % J /= -1 * m; % use matrix to compute gradient h = sigmoid(X * theta); J = (y' * log(h) + (1 - y)' * log(1 - h)) / (-1 * m); #J /= -1 * m; grad = X' * (h - y) / m; end
1 from __future__ import division 2 import numpy as np 3 import random 4 import matplotlib.pyplot as plt 5 6 def train(X, Y, iterateNum=10000000, alpha=0.003): 7 ''' 8 :param X: 訓練樣本的特徵集 9 :param Y: 訓練樣本的標籤 10 :param iterateNum: 梯度降低的迭代次數 11 :param alpha: 學習率 12 :return:theta 13 ''' 14 m, n = np.shape(X) 15 theta = np.zeros((n + 1, 1)) 16 # 在第一列添加x0 17 X_new = np.c_[np.ones(m), X] 18 19 for i in range(iterateNum): 20 m = np.shape(X_new)[0] 21 h = h_function(X_new, theta) 22 theta -= alpha * (np.dot(X_new.T, h - Y) / m) 23 24 if i % 100000 == 0: 25 print('\t---------iter=' + str(i) + ', J(θ)=' + str(J_function(X_new, Y, theta))) 26 27 print( str(J_function(X_new, Y, theta))) 28 return theta 29 30 def h_function(X, theta): 31 return sigmoid(np.dot(X, theta)) 32 33 def sigmoid(X): 34 return 1 / (1 + np.exp(-X )) 35 36 # 計算J(θ) 37 def J_function(X, Y, theta): 38 h = h_function(X, theta) 39 J_1 = np.dot(Y.T, np.log(h)) 40 J_2 = np.dot(1 - Y.T, np.log(1 - h)) 41 m = np.shape(X)[0] 42 J = (-1 / m) * (J_1 + J_2) 43 44 return J 45 46 def predict(x, theta): 47 if h_function(x, theta) >= 0.5: 48 return 1 49 else: 50 return 0 51 52 # 歸一化處理 53 def normalization(X): 54 m, n = np.shape(X) 55 X_new = np.zeros((m, n)) 56 57 for j in range(n): 58 max = np.max(X[:,j]) 59 min = np.min(X[:,j]) 60 d_value = max - min 61 for i in range(m): 62 X_new[i, j] = (X[i, j] - min) / d_value 63 64 return X_new 65 66 def plot_datas(X, Y, theta): 67 plt.figure() 68 69 # 繪製分隔直線 g = 0 70 x1 = [0, 1] 71 x2 = [(-1 / theta[2]) * (theta[0] + theta[1] * x1[0]), 72 (-1 / theta[2]) * (theta[0] + theta[1] * x1[1])] 73 plt.xlabel('x1') 74 plt.ylabel('x2') 75 76 plt.plot(x1, x2, color='b') 77 78 # 繪製數據點 79 admit_x1, admit_x2 = [],[] 80 not_admit_x1, not_admit_x2 = [],[] 81 for i in range(len(X)): 82 if (Y[i] == 1): 83 admit_x1.append(X[i][0]) 84 admit_x2.append(X[i][1]) 85 else: 86 not_admit_x1.append(X[i][0]) 87 not_admit_x2.append(X[i][1]) 88 89 plt.scatter(admit_x1, admit_x2, color='g') 90 plt.scatter(not_admit_x1, not_admit_x2, marker='x', color='r') 91 92 plt.legend(['logistic line', 'Admitted', 'Not admitted']) 93 plt.show() 94 95 if __name__ == '__main__': 96 train_datas = np.loadtxt('ex2data1.txt', delimiter=',') 97 X = train_datas[:,[0, 1]] 98 X = normalization(X) 99 Y = train_datas[:,[2]] 100 theta = train(X, Y) 101 102 print(theta) 103 plot_datas(X, Y, theta)
對於樣本數據,上面的梯度降低並非很是有效,不管是否預處理數據(歸一化或其餘方法),都必須反覆調整學習率和迭代次數。若是學習率過大,算法將不會收斂;若是太小,算法收斂的十分緩慢,須要增長迭代次數。代碼中的參數最終將使算法收斂於0.203。
1 from sklearn.linear_model import LogisticRegression 2 import numpy as np 3 4 if __name__ == '__main__': 5 train_datas = np.loadtxt("ex2data1.txt", delimiter=',') 6 X_train = train_datas[:,[0, 1]] 7 Y_train = train_datas[:,[2]] 8 9 logistic = LogisticRegression() 10 logistic.fit(X_train, Y_train) 11 12 theta = [logistic.intercept_[0], logistic.coef_[0]] 13 print(theta)
參考:
Ng視頻《Logistic Regression》
周志華《機器學習》
《機器學習導論》
Peter Flach《機器學習》
做者:我是8位的
出處:http://www.cnblogs.com/bigmonkey
本文以學習、研究和分享爲主,如需轉載,請聯繫本人,標明做者和出處,非商業用途!
掃描二維碼關注公衆號「我是8位的」