樸素貝葉斯分類器
1.基本公式:
$P(A|B)P(B) = P(B|A)P(A) $ (1)
設輸入樣本數據爲$D={(x_{0},y_{0}),(x_{1},y_{1}),..,(x_{n},y_{n}) }$,其中$x\in X,y \in Y,x_{i}^{(j)}$表示第i個樣本中的第j個特徵。
$x^{j}$可能的取值$x^{(j)} = { a_{j1},a_{j2},..,a_{jS_{j}}},j=1,2,..,S_{j}$,y可能的取值爲$y_{i} \in {c_{0},c_{1},..,c_{k}}$。
公式(1)在樣本D空間中的描述爲:
$P(y_{i}|x) = \frac{P(x|y_{i})P(y_{i})}{P(x)}$ (2)
即根據輸入的樣本x輸出對應的y中各個分類的機率值。取機率值最高的分類做爲最終的預測結果。因爲公式2中的$P(x)$對全部的$y$分類的貢獻都同樣,那麼公式2能夠化簡爲:
$P(y_{i}|x) = P(x|y_{i})P(y_{i})$
最終的預測結果爲:
$arg max(P(y_{i} = c_{k}|x)),c_{k} \in {c_{0},c_{1},..,c_{K}}$python
算法步驟:
1).計算先驗機率和條件機率:
$P(y_{i}) = \frac{\sum_{i=1}^{N}I(y_{i} = c_{k})}{N},k = 1,2,..,K$
$P(x^{(j)}=a_{jl}|y_{i} = c_{k}) = \frac{\sum_{i=1}^{N}I(x^{(j)} = a_{jl},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})}$
2).對於給定的實例$x=(x_{(1)},x_{(2)},..,x_{(n)})^T$ 計算:
$P(Y=c_{k}) \prod_{j=1}^{n}P(X^{(j)} = x^{(j)}|Y=c_{k}),k=1,2,..,K$
3).肯定實例$x$的分類:
$y = arg max {P(Y=c_{k}) \prod_{j=1}^{n}P(X^{(j)} = x^{(j)}|P(Y = c_{k}))}$算法
import numpy as np class NormBayes: def __init__(self): self.__label = [] self.__Prob_yi=[] self.__Prob_xi=[] self.__lamda = 1 def fit(self,X,Y): ''' @X - input numpy array as features @Y - input label ''' self.__calc_prob_yi(Y) self.__calc_prob_xi(X,Y) def __calc_prob_yi(self,Y): #clac priori probability self.__label = list(set(Y)) N = Y.shape[0];k = 0 self.__Prob_yi = np.zeros((len(self.__label))) for l in self.__label: count = 0 for n in range(N): if(Y[n] == l): count += 1 self.__Prob_yi[k] = float(count) / N #print "(yi = ",l,")= ",self.__Prob_yi[k] k += 1 def __calc_prob_xi(self,X,Y): #conditional probability y = list(set(Y)) num_cls = len(y); feat_dim = X.shape[1] self.__Prob_xi = np.zeros((num_cls,feat_dim)) for c in range(num_cls): count_yi = self.__count_label(Y,y[c]) #print "count_yi=",count_yi yi_idx = self.__get_data_idx(Y,y[c]) subX = self.__get_sub_data(X,yi_idx) for f in range(feat_dim): count_xi = np.count_nonzero(subX[:,f]) #print "count_x",f,"= ",count_xi self.__Prob_xi[c][f] = float(count_xi) / count_yi #print "(ck=",c,"xi=",f,")= ",self.__Prob_xi[c][f] def __count_label(self,Y,y): return list(Y).count(y) def __get_data_idx(self,Y,y): return [i for i,a in enumerate(Y) if a == y] def __get_sub_data(self,X,idx=[]): data = np.zeros((len(idx),X.shape[1])) for i in range(len(idx)): data[i] = X[idx[i]] return data def predict(self,X): ''' @X - single-predict if you input one sample, multi-predict if you input serval samples @return index of label ''' rows,cols = X.shape num_cls = len(self.__label) rsp = [] for r in range(rows): prob_y = np.zeros((num_cls)) for n in range(num_cls): prod = 1. for c in range(cols): if(X[r][c] != 0): prod *= self.__Prob_xi[n][c] prob_y[n] = prod * self.__Prob_yi[n] maxIdx = prob_y.argmax() rsp.append((self.__label[maxIdx],prob_y[maxIdx])) return rsp
例1.練一個貝葉斯分類器並肯定$x=(2,S)^T$的類標記y,表中$X^{(1)},X^{(2)}$爲特徵,取值的集合分別爲:$A_{1} \in {1,2,3},A_{2} \in {S,M,L}$
$Y$爲類標記,$Y \in {-1,1}$app
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
$X^{(1)}$ | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
$X^{(2)}$ | S | M | M | S | S | S | M | M | L | L | L | M | M | L | L |
$Y$ | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
X = np.array([[1,0,0,1,0,0],[1,0,0,0,1,0],[1,0,0,0,1,0], [1,0,0,1,0,0],[1,0,0,1,0,0],[0,1,0,1,0,0], [0,1,0,0,1,0],[0,1,0,0,1,0],[0,1,0,0,0,1], [0,1,0,0,0,1],[0,0,1,0,0,1],[0,0,1,0,1,0], [0,0,1,0,1,0],[0,0,1,0,0,1],[0,0,1,0,0,1]]) Y = np.array([-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1]) #Y = np.array([0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]) #single-predict #x = np.array([[0,1,0,1,0,0]]) #multi-predict x = np.array([[0,1,0,1,0,0],[1,0,0,0,1,0],[0,0,1,0,1,0]]) clf = NormBayes() clf.fit(X,Y) print clf.predict(x)
[(-1, 0.066666666666666666), (-1, 0.066666666666666666), (1, 0.11851851851851851)]
例2.貸款申請樣本數據表code
ID | 年齡 | 有工做 | 有本身的房子 | 信貸狀況 | 類別 |
---|---|---|---|---|---|
1 | 青年 | 否 | 否 | 通常 | 否 |
2 | 青年 | 否 | 否 | 好 | 否 |
3 | 青年 | 是 | 否 | 好 | 是 |
4 | 青年 | 是 | 是 | 通常 | 是 |
5 | 青年 | 否 | 否 | 通常 | 否 |
6 | 中年 | 否 | 否 | 通常 | 否 |
7 | 中年 | 否 | 否 | 好 | 否 |
8 | 中年 | 是 | 是 | 好 | 否 |
9 | 中年 | 否 | 是 | 很是好 | 是 |
10 | 中年 | 否 | 是 | 很是好 | 是 |
11 | 老年 | 否 | 是 | 很是好 | 是 |
12 | 老年 | 否 | 是 | 好 | 是 |
13 | 老年 | 是 | 否 | 好 | 是 |
14 | 老年 | 是 | 否 | 很是好 | 是 |
15 | 老年 | 否 | 否 | 通常 | 否 |
試預測:$x = (老年,是,否,通常,是)$ 是否發放貸款orm
X = np.array([[1,0,0,0,0,1,0,0,0],[1,0,0,0,0,0,1,0,0],[1,0,0,1,0,0,1,0,1], [1,0,0,1,1,1,0,0,1],[1,0,0,0,0,1,0,0,0],[0,1,0,0,0,1,0,0,0], [0,1,0,0,0,0,1,0,0],[0,1,0,1,1,0,1,0,0],[0,1,0,0,1,0,0,1,1], [0,1,0,0,1,0,0,1,1],[0,0,1,0,1,0,0,1,1],[0,0,1,0,1,0,1,0,1], [0,0,1,1,0,0,1,0,1],[0,0,1,1,0,0,0,1,1],[0,0,1,0,0,1,0,0,0]]) Y = np.array([0,0,1,1,0,0,0,1,1,1,1,1,1,1,0]) x = np.array([[0,0,1,1,0,1,0,0,1]]) clf = NormBayes() clf.fit(X,Y) print clf.predict(x)
[(1, 0.014631915866483762)]