樸素貝葉斯(Naive Bayesian)

時間 2019-12-11

標籤樸素貝葉 naive bayesian 简体版

原文原文鏈接

簡介

Naive Bayesian算法也叫樸素貝葉斯算法（或者稱爲傻瓜式貝葉斯分類）html

樸素（傻瓜）：特徵條件獨立假設算法

貝葉斯：基於貝葉斯定理dom

這個算法確實十分樸素（傻瓜），屬於監督學習,它是一個經常使用於尋找決策面的算法。ide

基本思想

（1）病人分類舉例函數

有六個病人他們的狀況以下：學習

症狀	職業	病名
打噴嚏	護士	感冒
打噴嚏	農夫	過敏
頭痛	建築工人	腦震盪
頭痛	建築工人	感冒
打噴嚏	教師	感冒
頭痛	教師	腦震盪

根據這張表若是來了第七個病人他是一個打噴嚏的建築工人spa

那麼他患上感冒的機率是多少 code

根據貝葉斯定理：orm

P(A|B) = P(B|A) P(A) / P(B)

能夠獲得：htm

P(感冒|打噴嚏x建築工人) = P(打噴嚏x建築工人|感冒) x P(感冒) / P(打噴嚏x建築工人)

假定感冒與打噴嚏相互獨立那麼上面的等式變爲：

P(感冒|打噴嚏x建築工人) = P(打噴嚏|感冒) x P(建築工人|感冒) x P(感冒) / （ P(打噴嚏) x P(建築工人) ）
P(感冒|打噴嚏x建築工人) = 2/3 x 1/3 x 1/2 /（ 1/2 x 1/3 ）= 2/3

所以這位打噴嚏的建築工人患上感冒的機率大約是66%

（2）樸素貝葉斯分類器公式

假設某個體有n項特徵，分別爲F一、F二、…、Fn。現有m個類別，分別爲C一、C二、…、Cm。貝葉斯分類器就是計算出機率最大的那個分類，也就是求下面這個算式的最大值：

P(C|F1 x F2 ...Fn) = P(F1 x F2 ... Fn|C) x P(C) / P(F1 x F2 ... Fn)

因爲 P(F1xF2 … Fn) 對於全部的類別都是相同的，能夠省略，問題就變成了求

P(F1 x F2 ... Fn|C)P(C)

的最大值

根據樸素貝葉斯的樸素特色（特徵條件獨立假設），所以：

P(F1 x F2 ... Fn|C)P(C) = P(F1|C) x P(F2|C) ... P(Fn|C)P(C)

上式等號右邊的每一項，均可以從統計資料中獲得，由此就能夠計算出每一個類別對應的機率，從而找出最大機率的那個類。

代碼實現

環境：MacOS mojave　　10.14.3

Python　　3.7.0

使用庫：scikit-learn 0.19.2

在終端輸入下面的代碼安裝sklearn

pip install sklearn

sklearn庫官方文檔http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
#生成六個訓練點，其中前三個屬於標籤（分類）1 後三個屬於標籤（分類）2
>>> from sklearn.naive_bayes import GaussianNB
#導入外部模塊
>>> clf = GaussianNB()#建立高斯分類器，把GaussianNB賦值給clf（分類器）
>>> clf.fit(X, Y)#開始訓練
#它會學習各類模式，而後就造成了咱們剛剛建立的分類器（clf）
#咱們在分類器上調用fit函數，接下來將兩個參數傳遞給fit函數，一個是特徵x 一個是標籤y#最後咱們讓已經完成了訓練的分類器進行一些預測，咱們爲它提供一個新點[-0.8,-1]
>>> print(clf.predict([[-0.8, -1]]))
[1]

上面的流程爲：建立訓練點->建立分類器->進行訓練->對新的數據進行分類

上面的新的數據屬於標籤（分類）2

繪製決策面

對於給定的一副散點圖，其中藍色是慢速區紅色是快速區，如何畫出一條線將點分開

perp_terrain_data.py

生成訓練點

import random


def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
    random.seed(42)
    grade = [random.random() for ii in range(0,n_points)]
    bumpy = [random.random() for ii in range(0,n_points)]
    error = [random.random() for ii in range(0,n_points)]
    y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
    for ii in range(0, len(y)):
        if grade[ii]>0.8 or bumpy[ii]>0.8:
            y[ii] = 1.0

### split into train/test sets
    X = [[gg, ss] for gg, ss in zip(grade, bumpy)]
    split = int(0.75*n_points)
    X_train = X[0:split]
    X_test  = X[split:]
    y_train = y[0:split]
    y_test  = y[split:]

    grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]
    bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]
    grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]
    bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]

#    training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
#            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}


    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}
            , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}

    return X_train, y_train, X_test, y_test
#    return training_data, test_data

ClassifyNB.py

高斯分類

def classify(features_train, labels_train):   
    ### import the sklearn module for GaussianNB
    ### create classifier
    ### fit the classifier on the training features and labels
    ### return the fit classifier
    
    
    from sklearn.naive_bayes import GaussianNB
    clf = GaussianNB()
    clf.fit(features_train, labels_train)
    return clf
    pred = clf.predict(features_test)

class_vis.py

繪圖與保存圖像

import warnings
warnings.filterwarnings("ignore")

import matplotlib 
matplotlib.use('agg')

import matplotlib.pyplot as plt
import pylab as pl
import numpy as np

#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()

def prettyPicture(clf, X_test, y_test):
    x_min = 0.0; x_max = 1.0
    y_min = 0.0; y_max = 1.0

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    h = .01  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())

    plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

    # Plot also the test points
    grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
    bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
    grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
    bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

    plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")
    plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")
    plt.legend()
    plt.xlabel("bumpiness")
    plt.ylabel("grade")

    plt.savefig("test.png")

Main.py

主程序

from prep_terrain_data import makeTerrainData
from class_vis import prettyPicture
from ClassifyNB import classify

import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()

### the training data (features_train, labels_train) have both "fast" and "slow" points mixed
### in together--separate them so we can give them different colors in the scatterplot,
### and visually identify them
grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]

clf = classify(features_train, labels_train)

### draw the decision boundary with the text points overlaid
prettyPicture(clf, features_test, labels_test)

運行獲得分類完成圖像：

能夠看到並非全部的點都正確分類了，還有一小部分點被錯誤分類了

計算分類正確率：

accuracy.py

from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
from classify import NBAccuracy

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl


features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracy():
    accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
    return accuracy

在主程序Main結尾加入一段：

from studentCode import submitAccuracy
print(submitAccuracy())

獲得正確率：0.884

樸素貝葉斯的優點與劣勢

優勢：一、很是易於執行　　二、它的特徵空間很是大　　三、運行很是容易、很是有效

缺點：它會與間斷、由多個單詞組成且意義明顯不一樣的詞語不太適合（eg：芝加哥公牛）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。