對於給定的訓練數據集,首先基於特徵條件獨立假設學習輸入/輸出的聯合機率分佈;而後基於此模型,對給定的輸入\(x\),利用貝葉斯定理求出後驗機率最大的輸出\(y\)。python
特徵獨立性假設:在利用貝葉斯定理進行預測時,咱們須要求解條件機率\(P(x|y_k)=P(x_1,x_2,...,x_n|y_k)P(x|y_k)=P(x_1,x_2,...,x_n|y_k)\),它的參數規模是指數數量級別的,假設第i維特徵可取值的個數有\(T_i\)個,類別取值個數爲k個,那麼參數個數爲:\(k\prod_{i=1}^nT_i\)。這顯然不可行,因此樸素貝葉斯算法對條件機率分佈做出了獨立性的假設,其實是爲了簡化計算。算法
import numpy as np import math from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from collections import Counter
從sklearn數據集中加載鳶尾花分類數據集學習
iris = load_iris() X, Y = iris.data, iris.target X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3) print('X_train[0]: {}'.format(X_train[0])) print('Y_train[0]: {}'.format(Y_train[0])) # 查看訓練集各個類別的數量 for l in set(Y_train): print('label: %s ,count: %d' % (l, len(Y_train[Y_train==l])))
代碼輸出:測試
X_train[0]: [5.2 3.5 1.5 0.2] Y_train[0]: 0 label: 0 ,count: 35 label: 1 ,count: 32 label: 2 ,count: 38
高斯模型的樸素貝葉斯:lua
對於取值是連續型的特徵變量,用離散型特徵的求解方法時會有不少特徵取值的條件機率爲0,因此咱們使用高斯模型的樸素貝葉斯,它假設每一維特徵都服從高斯分佈。即:spa
\(\mu_{y_k,i}\)是分類爲\(y_k\)的樣本中,第\(i\)維特徵取值的均值;\(\sigma_{y_k,i}^2\)爲其方差code
class GaussianNaiveBayes: def __init__(self): self.parameters = {} self.prior = {} # 訓練過程就是求解先驗機率和高斯分佈參數的過程 # X:(樣本數,特徵維度) Y:(樣本數,) def fit(self, X, Y): self._get_prior(Y) # 計算先驗機率 labels = set(Y) for label in labels: samples = X[Y==label] # 計算高斯分佈的參數:均值和標準差 means = np.mean(samples, axis=0) stds = np.std(samples, axis=0) self.parameters[label] = { 'means': means, 'stds': stds } # x:單個樣本 def predict(self, x): probs = sorted(self._cal_likelihoods(x).items(), key=lambda x:x[-1]) # 按機率從小到大排序 return probs[-1][0] # 計算模型在測試集的準確率 # X_test:(測試集樣本個數,特徵維度) def evaluate(self, X_test, Y_test): true_pred = 0 for i, x in enumerate(X_test): label = self.predict(x) if label == Y_test[i]: true_pred += 1 return true_pred / len(X_test) # 計算每一個類別的先驗機率 def _get_prior(self, Y): cnt = Counter(Y) for label, count in cnt.items(): self.prior[label] = count / len(Y) # 高斯分佈 def _gaussian(self, x, mean, std): exponent = math.exp(-(math.pow(x - mean, 2)/(2 * math.pow(std, 2)))) return (1 / (math.sqrt(2 * math.pi) * std)) * exponent # 計算樣本x屬於每一個類別的似然機率 def _cal_likelihoods(self, x): likelihoods = {} for label, params in self.parameters.items(): means = params['means'] stds = params['stds'] prob = self.prior[label] # 計算每一個特徵的條件機率,P(xi|yk) for i in range(len(means)): prob *= self._gaussian(x[i], means[i], stds[i]) likelihoods[label] = prob return likelihoods
在測試集上評估分類器:orm
gussian_nb = GaussianNaiveBayes() gussian_nb.fit(X_train, Y_train) print('樣本[4.4, 3.2, 1.3, 0.2]的預測結果: %d' % gussian_nb.predict([4.4, 3.2, 1.3, 0.2])) print('測試集的準確率: %f' % gussian_nb.evaluate(X_test, Y_test))
代碼輸出:排序
樣本[4.4, 3.2, 1.3, 0.2]的預測結果: 0 測試集的準確率: 0.955556
from sklearn.naive_bayes import GaussianNB clf = GaussianNB() clf.fit(X_train, Y_train) print('(sklearn)樣本[4.4, 3.2, 1.3, 0.2]的預測結果: %d' % clf.predict([[4.4, 3.2, 1.3, 0.2]])[0]) print('(sklearn)測試集的準確率: %f' % clf.score(X_test, Y_test))
代碼輸出:ci
(sklearn)樣本[4.4, 3.2, 1.3, 0.2]的預測結果: 0 (sklearn)測試集的準確率: 0.955556