Naive Bayesian算法 也叫樸素貝葉斯算法(或者稱爲傻瓜式貝葉斯分類)html
樸素(傻瓜):特徵條件獨立假設算法
貝葉斯:基於貝葉斯定理dom
這個算法確實十分樸素(傻瓜),屬於監督學習,它是一個經常使用於尋找決策面的算法。ide
(1)病人分類舉例函數
有六個病人 他們的狀況以下:學習
症狀 | 職業 | 病名 |
打噴嚏 | 護士 | 感冒 |
打噴嚏 | 農夫 | 過敏 |
頭痛 | 建築工人 | 腦震盪 |
頭痛 | 建築工人 | 感冒 |
打噴嚏 | 教師 | 感冒 |
頭痛 | 教師 | 腦震盪 |
根據這張表 若是來了第七個病人 他是一個 打噴嚏 的 建築工人spa
那麼他患上感冒的機率是多少 code
根據貝葉斯定理:orm
P(A|B) = P(B|A) P(A) / P(B)
能夠獲得:htm
P(感冒|打噴嚏x建築工人) = P(打噴嚏x建築工人|感冒) x P(感冒) / P(打噴嚏x建築工人)
假定 感冒 與 打噴嚏 相互獨立 那麼上面的等式變爲:
P(感冒|打噴嚏x建築工人) = P(打噴嚏|感冒) x P(建築工人|感冒) x P(感冒) / ( P(打噴嚏) x P(建築工人) )
P(感冒|打噴嚏x建築工人) = 2/3 x 1/3 x 1/2 /( 1/2 x 1/3 )= 2/3
所以 這位打噴嚏的建築工人 患上感冒的機率大約是66%
(2)樸素貝葉斯分類器公式
假設某個體有n項特徵,分別爲F一、F二、…、Fn。現有m個類別,分別爲C一、C二、…、Cm。貝葉斯分類器就是計算出機率最大的那個分類,也就是求下面這個算式的最大值:
P(C|F1 x F2 ...Fn) = P(F1 x F2 ... Fn|C) x P(C) / P(F1 x F2 ... Fn)
因爲 P(F1xF2 … Fn) 對於全部的類別都是相同的,能夠省略,問題就變成了求
P(F1 x F2 ... Fn|C)P(C)
的最大值
根據樸素貝葉斯的樸素特色(特徵條件獨立假設),所以:
P(F1 x F2 ... Fn|C)P(C) = P(F1|C) x P(F2|C) ... P(Fn|C)P(C)
上式等號右邊的每一項,均可以從統計資料中獲得,由此就能夠計算出每一個類別對應的機率,從而找出最大機率的那個類。
環境:MacOS mojave 10.14.3
Python 3.7.0
使用庫:scikit-learn 0.19.2
在終端輸入下面的代碼安裝sklearn
pip install sklearn
sklearn庫官方文檔http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
>>> import numpy as np >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> Y = np.array([1, 1, 1, 2, 2, 2]) #生成六個訓練點,其中前三個屬於標籤(分類)1 後三個屬於標籤(分類)2 >>> from sklearn.naive_bayes import GaussianNB #導入外部模塊 >>> clf = GaussianNB()#建立高斯分類器,把GaussianNB賦值給clf(分類器) >>> clf.fit(X, Y)#開始訓練 #它會學習各類模式,而後就造成了咱們剛剛建立的分類器(clf) #咱們在分類器上調用fit函數,接下來將兩個參數傳遞給fit函數,一個是特徵x 一個是標籤y#最後咱們讓已經完成了訓練的分類器進行一些預測,咱們爲它提供一個新點[-0.8,-1] >>> print(clf.predict([[-0.8, -1]])) [1]
上面的流程爲:建立訓練點->建立分類器->進行訓練->對新的數據進行分類
上面的新的數據屬於標籤(分類)2
對於給定的一副散點圖,其中藍色是慢速區 紅色是快速區,如何畫出一條線 將點分開
perp_terrain_data.py
生成訓練點
import random def makeTerrainData(n_points=1000): ############################################################################### ### make the toy dataset random.seed(42) grade = [random.random() for ii in range(0,n_points)] bumpy = [random.random() for ii in range(0,n_points)] error = [random.random() for ii in range(0,n_points)] y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)] for ii in range(0, len(y)): if grade[ii]>0.8 or bumpy[ii]>0.8: y[ii] = 1.0 ### split into train/test sets X = [[gg, ss] for gg, ss in zip(grade, bumpy)] split = int(0.75*n_points) X_train = X[0:split] X_test = X[split:] y_train = y[0:split] y_test = y[split:] grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0] bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0] grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1] bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1] # training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig} # , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}} grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0] bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0] grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1] bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1] test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig} , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}} return X_train, y_train, X_test, y_test # return training_data, test_data
ClassifyNB.py
高斯分類
def classify(features_train, labels_train): ### import the sklearn module for GaussianNB ### create classifier ### fit the classifier on the training features and labels ### return the fit classifier from sklearn.naive_bayes import GaussianNB clf = GaussianNB() clf.fit(features_train, labels_train) return clf pred = clf.predict(features_test)
class_vis.py
繪圖與保存圖像
import warnings warnings.filterwarnings("ignore") import matplotlib matplotlib.use('agg') import matplotlib.pyplot as plt import pylab as pl import numpy as np #import numpy as np #import matplotlib.pyplot as plt #plt.ioff() def prettyPicture(clf, X_test, y_test): x_min = 0.0; x_max = 1.0 y_min = 0.0; y_max = 1.0 # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, m_max]x[y_min, y_max]. h = .01 # step size in the mesh xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic) # Plot also the test points grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0] bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0] grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1] bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1] plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast") plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow") plt.legend() plt.xlabel("bumpiness") plt.ylabel("grade") plt.savefig("test.png")
Main.py
主程序
from prep_terrain_data import makeTerrainData from class_vis import prettyPicture from ClassifyNB import classify import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() ### the training data (features_train, labels_train) have both "fast" and "slow" points mixed ### in together--separate them so we can give them different colors in the scatterplot, ### and visually identify them grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0] bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0] grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1] bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1] clf = classify(features_train, labels_train) ### draw the decision boundary with the text points overlaid prettyPicture(clf, features_test, labels_test)
運行獲得分類完成圖像:
能夠看到並非全部的點都正確分類了,還有一小部分點被錯誤分類了
計算分類正確率:
accuracy.py
from class_vis import prettyPicture from prep_terrain_data import makeTerrainData from classify import NBAccuracy import matplotlib.pyplot as plt import numpy as np import pylab as pl features_train, labels_train, features_test, labels_test = makeTerrainData() def submitAccuracy(): accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test) return accuracy
在主程序Main結尾加入一段:
from studentCode import submitAccuracy print(submitAccuracy())
獲得正確率:0.884
優勢:一、很是易於執行 二、它的特徵空間很是大 三、運行很是容易、很是有效
缺點:它會與間斷、由多個單詞組成且意義明顯不一樣的詞語不太適合(eg:芝加哥公牛)