樸素貝葉斯(Naive Bayesian)

  • 簡介

Naive Bayesian算法 也叫樸素貝葉斯算法(或者稱爲傻瓜式貝葉斯分類)html

樸素(傻瓜):特徵條件獨立假設算法

貝葉斯:基於貝葉斯定理dom

這個算法確實十分樸素(傻瓜),屬於監督學習,它是一個經常使用於尋找決策面的算法。ide

 

  • 基本思想

(1)病人分類舉例函數

有六個病人 他們的狀況以下:學習

症狀 職業 病名
打噴嚏 護士 感冒
打噴嚏 農夫 過敏
頭痛 建築工人 腦震盪
頭痛 建築工人 感冒
打噴嚏 教師 感冒
頭痛 教師 腦震盪

 

 

 

 

 

 

 

根據這張表 若是來了第七個病人 他是一個 打噴嚏 的 建築工人spa

那麼他患上感冒的機率是多少 code

根據貝葉斯定理:orm

P(A|B) = P(B|A) P(A) / P(B)

能夠獲得:htm

P(感冒|打噴嚏x建築工人) = P(打噴嚏x建築工人|感冒) x P(感冒) / P(打噴嚏x建築工人)

假定 感冒 與 打噴嚏 相互獨立 那麼上面的等式變爲:

P(感冒|打噴嚏x建築工人) = P(打噴嚏|感冒) x P(建築工人|感冒) x P(感冒) / ( P(打噴嚏) x P(建築工人) )
P(感冒|打噴嚏x建築工人) = 2/3 x 1/3 x 1/2 /( 1/2 x 1/3 )= 2/3

所以 這位打噴嚏的建築工人 患上感冒的機率大約是66%

 

(2)樸素貝葉斯分類器公式

假設某個體有n項特徵,分別爲F一、F二、…、Fn。現有m個類別,分別爲C一、C二、…、Cm。貝葉斯分類器就是計算出機率最大的那個分類,也就是求下面這個算式的最大值:

P(C|F1 x F2 ...Fn) = P(F1 x F2 ... Fn|C) x P(C) / P(F1 x F2 ... Fn)

因爲 P(F1xF2 … Fn) 對於全部的類別都是相同的,能夠省略,問題就變成了求

P(F1 x F2 ... Fn|C)P(C)

的最大值

根據樸素貝葉斯的樸素特色(特徵條件獨立假設),所以:

P(F1 x F2 ... Fn|C)P(C) = P(F1|C) x P(F2|C) ... P(Fn|C)P(C)

上式等號右邊的每一項,均可以從統計資料中獲得,由此就能夠計算出每一個類別對應的機率,從而找出最大機率的那個類。

 

  • 代碼實現

環境:MacOS mojave  10.14.3

Python  3.7.0

使用庫:scikit-learn    0.19.2

 

在終端輸入下面的代碼安裝sklearn

pip install sklearn

sklearn庫官方文檔http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
#生成六個訓練點,其中前三個屬於標籤(分類)1 後三個屬於標籤(分類)2
>>> from sklearn.naive_bayes import GaussianNB
#導入外部模塊
>>> clf = GaussianNB()#建立高斯分類器,把GaussianNB賦值給clf(分類器)
>>> clf.fit(X, Y)#開始訓練
#它會學習各類模式,而後就造成了咱們剛剛建立的分類器(clf)
#咱們在分類器上調用fit函數,接下來將兩個參數傳遞給fit函數,一個是特徵x 一個是標籤y#最後咱們讓已經完成了訓練的分類器進行一些預測,咱們爲它提供一個新點[-0.8,-1]
>>> print(clf.predict([[-0.8, -1]]))
[1]

上面的流程爲:建立訓練點->建立分類器->進行訓練->對新的數據進行分類

上面的新的數據屬於標籤(分類)2

 

  • 繪製決策面

對於給定的一副散點圖,其中藍色是慢速區 紅色是快速區,如何畫出一條線 將點分開

perp_terrain_data.py

生成訓練點

import random


def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
    random.seed(42)
    grade = [random.random() for ii in range(0,n_points)]
    bumpy = [random.random() for ii in range(0,n_points)]
    error = [random.random() for ii in range(0,n_points)]
    y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
    for ii in range(0, len(y)):
        if grade[ii]>0.8 or bumpy[ii]>0.8:
            y[ii] = 1.0

### split into train/test sets
    X = [[gg, ss] for gg, ss in zip(grade, bumpy)]
    split = int(0.75*n_points)
    X_train = X[0:split]
    X_test  = X[split:]
    y_train = y[0:split]
    y_test  = y[split:]

    grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]
    bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]
    grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]
    bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]

#    training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
#            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}


    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}

    return X_train, y_train, X_test, y_test
#    return training_data, test_data

 

ClassifyNB.py

高斯分類

def classify(features_train, labels_train):   
    ### import the sklearn module for GaussianNB
    ### create classifier
    ### fit the classifier on the training features and labels
    ### return the fit classifier
    
    
    from sklearn.naive_bayes import GaussianNB
    clf = GaussianNB()
    clf.fit(features_train, labels_train)
    return clf
    pred = clf.predict(features_test)
    

 

class_vis.py

繪圖與保存圖像

import warnings
warnings.filterwarnings("ignore")

import matplotlib 
matplotlib.use('agg')

import matplotlib.pyplot as plt
import pylab as pl
import numpy as np

#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()

def prettyPicture(clf, X_test, y_test):
    x_min = 0.0; x_max = 1.0
    y_min = 0.0; y_max = 1.0

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    h = .01  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

    # Plot also the test points
    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")
    plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")
    plt.legend()
    plt.xlabel("bumpiness")
    plt.ylabel("grade")

    plt.savefig("test.png")

 

Main.py

主程序

from prep_terrain_data import makeTerrainData
from class_vis import prettyPicture
from ClassifyNB import classify

import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()

### the training data (features_train, labels_train) have both "fast" and "slow" points mixed
### in together--separate them so we can give them different colors in the scatterplot,
### and visually identify them
grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]

clf = classify(features_train, labels_train)

### draw the decision boundary with the text points overlaid
prettyPicture(clf, features_test, labels_test)

 

運行獲得分類完成圖像:

 

 能夠看到並非全部的點都正確分類了,還有一小部分點被錯誤分類了

計算分類正確率:

accuracy.py

from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
from classify import NBAccuracy

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracy():
    accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
    return accuracy

 

在主程序Main結尾加入一段:

from studentCode import submitAccuracy
print(submitAccuracy())

獲得正確率:0.884

 

  • 樸素貝葉斯的優點與劣勢

 優勢:一、很是易於執行  二、它的特徵空間很是大  三、運行很是容易、很是有效

 缺點:它會與間斷、由多個單詞組成且意義明顯不一樣的詞語不太適合(eg:芝加哥公牛)

相關文章
相關標籤/搜索