logistic regression (Python&Matlab實現)

本練習以<機器學習實戰>爲基礎, 重現書中代碼, 以達到熟悉算法應用爲目的python

(注:matlab的版本轉載自http://blog.csdn.net/llp1992/article/details/45114421 , 感謝原做者的辛勞付出)算法

1.梯度上升算法app

新建一個logRegres.py文件, 在文件中添加以下代碼:dom

from numpy import *
#加載模塊 numpy
def loadDataSet():
    dataMat = []; labelMat = []
    #加路徑的話要寫做:open('D:\\testSet.txt','r') 缺省爲只讀
    fr = open('testSet.txt') 
    #readlines()函數一次讀取整個文件,並自動將文本分拆成一個行的列表, 
    #該列表支持python使用for...in...的結構進行處理 (一次只處理一行)   
    for line in fr.readlines():
        #strip()函數 刪除字符串中的首尾空格或製表符等  
        #split()函數 按照符號(製表符)進行分割
        lineArr = line.strip().split()
        #每一行加入第零維 x0 = 1
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat, labelMat
    
def sigmoid(inX): #定義sigmoid函數
    return 1.0/(1 + exp(-inX))

def gradAscent(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)             #轉換爲numpy內置的矩陣格式
    labelMat = mat(classLabels).transpose() #transpose()是轉置的做用
    m,n = shape(dataMatrix)  #獲取矩陣的維數
    alpha = 0.001  #設定步長
    maxCycles = 500 #設定循環次數
    weights = ones((n,1)) #初始化權值
    for k in range(maxCycles):              #heavy on matrix operations
        h = sigmoid(dataMatrix*weights)     #logistic regression的hypothesis
        error = (labelMat - h)              
        weights = weights + alpha * dataMatrix.transpose()* error #更新權值
    return weights

在終端中輸入下面的命令:機器學習

>>> import logRegres
>>> dataArr,labelMat = logRegres.loadDataSet()
>>> weights = logRegres.gradAscent(dataArr, labelMat) #原書中漏掉了weights =

會獲得下面的結果, 這個是迭代500次後的結果:函數

matrix([[4.12414349],學習

           [0.48007329],測試

           [-0.6168482]])this

獲得權重後,就能夠把圖畫下來直觀的感覺下效果了:spa

在文本中添加以下的代碼:

def plotBestFit(weights):

    import matplotlib.pyplot as plt #把pyplot重命名爲plt, 方便之後使用
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0] 
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1: #標籤是1的數據
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) #第一維和第二維分別放入xcorde1和ycorde1這兩個list中
        else: #標籤是0的數據
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) #第一維和第二維分別放入xcorde2和ycorde2這兩個list中
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') #標籤爲1的數據標爲紅色
    ax.scatter(xcord2, ycord2, s=30, c='green')           #標籤爲0的數據標爲綠色
    x = arange(-3.0, 3.0, 0.1) #其實這裏的x = x1, y = x2; 而x0 = 1
    y = (-weights[0]-weights[1]*x)/weights[2] # 0 = weight[0]*x0 + weight[1]*x1 + weight[2]*x2 把分離超平面在二維畫出來
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

生成以下圖示的圖片:

 下面是matlab版本的實現代碼:

function returnVals = sigmoid(inX)
returnVals = 1.0 ./ (1.0 + exp(-inX));
end

上面這個是sigmoid函數, 下面的代碼會用到

function weight = gradAscend
%%
clc
close all 
clear
%%
data = load('testSet.txt');
[row, col] = size(data); %獲取數據的行和列
dataMat = data(:, 1:col-1); %去除data的最後一列
dataMat = [ones(row,1) dataMat];%用列1代替
labelMat = data(:, col); %data矩陣的最後一列做爲label矩陣
alpha = 0.001; %步進
maxCycle = 500; %設置最大循環次數
weight = ones(col, 1); %初始化權值值
for i = 1:maxCycle
    h = sigmoid(dataMat * weight); %logistic迴歸的hypothesis
    error = labelMat - h;
    weight = weight + alpha * dataMat' * error;
end

figure
scatter(dataMat(find(labelMat(:) == 0), 2), dataMat(find(labelMat(:) == 0), 3), 3);
hold on
scatter(dataMat(find(labelMat(:) == 1), 2), dataMat(find(labelMat(:) == 1), 3), 5);
hold on 
x = -3:0.1:3;
y = (-weight(1)-weight(2)*x)/weight(3);
plot(x.y)
hold off
end

效果以下:

 

2. 隨機梯度上升

梯度上升算法在每次更新迴歸係數時須要遍歷這個數據集, 假若數據集規模較大時, 時間空間的複雜度就難以承受了, 一種新的辦法是每次只用一個樣本點更新迴歸係數, 這種方法稱爲隨機梯度上升.

在原文本中插入一下代碼:

def stocGradAscent0(dataMatrix, classLabels):
    m,n = shape(dataMatrix)
    alpha = 0.01 #設定步進值爲0.1
    weights = ones(n)   #初始化權值
    for i in range(m): #每次只選取一個點進行權值的更新運算可節省很多時間
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

在python命令行窗口輸入下述命令:

>>> reload(logRegres)

>>> dataArr,labelMat=logRegres.loadDataSet()
>>> weights=logRegres.stocGradAscent0(array(dataArr),labelMat)
>>> logRegres.plotBestFit(weights)

獲得以下的圖形:

matlab版本的代碼以下:

function stocGradAscent
%%
%
% Description : LogisticRegression using stocGradAsscent
% Author : Liulongpo
% Time:2015-4-18 10:57:25
%
%%
clc
clear 
close all
%%
data = load('testSet.txt');
[row , col] = size(data);
dataMat = [ones(row,1) data(:,1:col-1)];
alpha = 0.01;
labelMat = data(:,col);
weight = ones(col,1);
for i = 1:row
 h = sigmoid(dataMat(i,:)*weight);
 error = labelMat(i) - h;

 weight = weight + alpha * error * dataMat(i,:)'
end
figure
scatter(dataMat(find(labelMat(:)==0),2),dataMat(find(labelMat(:)==0),3),5);
hold on
scatter(dataMat(find(labelMat(:) == 1),2),dataMat(find(labelMat(:) == 1),3),5);
hold on
x = -3:0.1:3;
y = -(weight(1)+weight(2)*x)/weight(3);
plot(x,y)
hold off
end

效果圖以下所示:

彷佛效果不太好, 由於訓練的次數比較少, 只一輪, 下面修改代碼, 並改進其它的一些問題:

def stocGradAscent1(dataMatrix, classLabels, numIter=150): #可本身設定更新的輪數,默認爲150
    m,n = shape(dataMatrix)
    weights = ones(n)   #初始化權值
    for j in range(numIter): #第j輪
        dataIndex = range(m)
        for i in range(m): #第j輪中的第i個數據
            alpha = 4/(1.0+j+i)+0.0001    #alpha會隨着更新的次數增長而愈來愈小
            randIndex = int(random.uniform(0,len(dataIndex)))#每次的i循環的randIndex的值都不一樣
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights

一個重要的改進是alpha 的值再也不是一個固定的值, 而是會隨着更新的次數增長而愈來愈小, 但0.0001是它的下限.

還有一個改進是 每輪的更新不會按照既有的順序, 這樣能夠避免權值週期性的波動.

下面是150輪後的圖形:

 

 

能夠看到, 隨機梯度上升算法比梯度上升算法收斂的更快.

 下面是matlab的版本:

function ImproveStocGradAscent
%%
clc
clear 
close all
%%
data = load('testSet.txt');
[row , col] = size(data);
dataMat = [ones(row,1) data(:,1:col-1)];
numIter = 150;
labelMat = data(:,col);
weight = ones(col,1);

for j = 1: numIter
    for i = 1:row
        alpha = 4/(1.0+j+i) + 0.0001;
        randIndex = randi(row); %產生1到100間的隨機數
        h = sigmoid(dataMat(randIndex,:)*weight);
        error = labelMat(randIndex) - h;
        weight = weight + alpha * error * dataMat(randIndex,:)';
    end
end

figure
scatter(dataMat(find(labelMat(:)==0),2),dataMat(find(labelMat(:)==0),3),5);
hold on
scatter(dataMat(find(labelMat(:) == 1),2),dataMat(find(labelMat(:) == 1),3),5);
hold on
x = -3:0.1:3;
y = -(weight(1)+weight(2)*x)/weight(3);
plot(x,y)
hold off
 
end

效果以下:

3. 一個實際的例子: 預測病馬是否可以存活

這裏每一個病馬有21個特徵:

def classifyVector(inX, weights): #預測函數
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: return 1.0
    else: return 0.0

def colicTest():
    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
    
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines(): #訓練集有299行
        currLine = line.strip().split('\t') #每一行的currLine有22個元素
        lineArr =[]
        for i in range(21): #把currLine的前21個元素放入一個list中去
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr) # 再把這個list放入一個更大的list中去
        trainingLabels.append(float(currLine[21])) #數據集的最後一列是標籤列
    
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) #訓練1000輪
    
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines(): #測試集有67個數據
        numTestVec += 1.0 #從0開始, 每測試一個,數目加1
        currLine = line.strip().split('\t')
        lineArr =[] 
        for i in range(21):
            lineArr.append(float(currLine[i])) #生成每一個測試數據的list
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]): #若是預測值與真實值不等
            errorCount += 1 #則錯誤加1
    errorRate = (float(errorCount)/numTestVec)
    print "the error rate of this test is: %f" % errorRate
    return errorRate

def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests): #測試10次, 求平均
        errorSum += colicTest()
    print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))

運行結果以下:

>>> logRegres.multiTest()logRegres.py:19: RuntimeWarning: overflow encountered in exp return 1.0/(1+exp(-inX))the error rate of this test is: 0.328358the error rate of this test is: 0.268657the error rate of this test is: 0.313433the error rate of this test is: 0.388060the error rate of this test is: 0.268657the error rate of this test is: 0.358209the error rate of this test is: 0.343284the error rate of this test is: 0.268657the error rate of this test is: 0.432836the error rate of this test is: 0.313433after 10 iterations the average error rate is: 0.328358

相關文章
相關標籤/搜索