樸素貝葉斯分類器

時間 2019-11-13

標籤樸素貝葉分類器简体版

原文原文鏈接

樸素貝葉斯分類器
1.基本公式：
$P(A|B)P(B) = P(B|A)P(A) $ (1)
設輸入樣本數據爲$D={(x_{0},y_{0}),(x_{1},y_{1}),..,(x_{n},y_{n}) }$,其中$x\in X,y \in Y，x_{i}^{(j)}$表示第i個樣本中的第j個特徵。
$x^{j}$可能的取值$x^{(j)} = { a_{j1},a_{j2},..,a_{jS_{j}}},j=1,2,..,S_{j}$,y可能的取值爲$y_{i} \in {c_{0},c_{1},..,c_{k}}$。
公式（1）在樣本D空間中的描述爲：
$P(y_{i}|x) = \frac{P(x|y_{i})P(y_{i})}{P(x)}$ (2)
即根據輸入的樣本x輸出對應的y中各個分類的機率值。取機率值最高的分類做爲最終的預測結果。因爲公式2中的$P(x)$對全部的$y$分類的貢獻都同樣，那麼公式2能夠化簡爲：
$P(y_{i}|x) = P(x|y_{i})P(y_{i})$
最終的預測結果爲：
$arg max(P(y_{i} = c_{k}|x)),c_{k} \in {c_{0},c_{1},..,c_{K}}$python

算法步驟：
1).計算先驗機率和條件機率：
$P(y_{i}) = \frac{\sum_{i=1}^{N}I(y_{i} = c_{k})}{N},k = 1,2,..,K$
$P(x^{(j)}=a_{jl}|y_{i} = c_{k}) = \frac{\sum_{i=1}^{N}I(x^{(j)} = a_{jl},y_{i}=c_{k})}{\sum_{i=1}^{N}I(y_{i}=c_{k})}$
2).對於給定的實例$x=(x_{(1)},x_{(2)},..,x_{(n)})^T$ 計算：
$P(Y=c_{k}) \prod_{j=1}^{n}P(X^{(j)} = x^{(j)}|Y=c_{k}),k=1,2,..,K$
3).肯定實例$x$的分類：
$y = arg max {P(Y=c_{k}) \prod_{j=1}^{n}P(X^{(j)} = x^{(j)}|P(Y = c_{k}))}$算法

import numpy as np

class NormBayes:
    def __init__(self):
        self.__label = []
        self.__Prob_yi=[]
        self.__Prob_xi=[]
        self.__lamda = 1
    def fit(self,X,Y):
        '''
        @X - input numpy array as features
        @Y - input label
        '''
        self.__calc_prob_yi(Y)
        self.__calc_prob_xi(X,Y)        

    def __calc_prob_yi(self,Y):
        #clac  priori probability
        self.__label = list(set(Y))
        N = Y.shape[0];k = 0
        self.__Prob_yi = np.zeros((len(self.__label)))
        for l in self.__label:
            count = 0
            for n in range(N):
                if(Y[n] == l):
                    count += 1
            self.__Prob_yi[k] = float(count) / N
            #print "(yi = ",l,")= ",self.__Prob_yi[k]
            k += 1
    def __calc_prob_xi(self,X,Y):
        #conditional probability
        y = list(set(Y))
        num_cls = len(y);
        feat_dim = X.shape[1]
        self.__Prob_xi = np.zeros((num_cls,feat_dim))
        for c in range(num_cls): 
            count_yi = self.__count_label(Y,y[c])
            #print "count_yi=",count_yi
            yi_idx = self.__get_data_idx(Y,y[c])
            subX = self.__get_sub_data(X,yi_idx)
            for f in range(feat_dim):
                count_xi = np.count_nonzero(subX[:,f])                
                #print "count_x",f,"= ",count_xi
                self.__Prob_xi[c][f] = float(count_xi) / count_yi
                #print "(ck=",c,"xi=",f,")= ",self.__Prob_xi[c][f]

    def __count_label(self,Y,y):     
        return list(Y).count(y)

    def __get_data_idx(self,Y,y):
        return [i for i,a in enumerate(Y) if a == y]

    def __get_sub_data(self,X,idx=[]):
        data = np.zeros((len(idx),X.shape[1]))
        for i in range(len(idx)):
            data[i] = X[idx[i]]
        return data
    def predict(self,X):
        '''
        @X - single-predict if you input one sample,  
            multi-predict if you input serval samples
        @return index of label
        '''
        rows,cols = X.shape
        num_cls = len(self.__label)
        rsp = []
        for r in range(rows):            
            prob_y = np.zeros((num_cls))
            for n in range(num_cls):
                prod = 1.
                for c in range(cols):
                    if(X[r][c] != 0):
                        prod *= self.__Prob_xi[n][c]
                prob_y[n] = prod * self.__Prob_yi[n]
            maxIdx = prob_y.argmax()
            rsp.append((self.__label[maxIdx],prob_y[maxIdx]))
        return rsp

例1.練一個貝葉斯分類器並肯定$x=(2,S)^T$的類標記y,表中$X^{(1)},X^{(2)}$爲特徵，取值的集合分別爲：$A_{1} \in {1,2,3},A_{2} \in {S,M,L}$
$Y$爲類標記，$Y \in {-1,1}$app

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
$X^{(1)}$	1	1	1	1	1	2	2	2	2	2	3	3	3	3	3
$X^{(2)}$	S	M	M	S	S	S	M	M	L	L	L	M	M	L	L
$Y$	-1	-1	1	1	-1	-1	-1	1	1	1	1	1	1	1	-1

X = np.array([[1,0,0,1,0,0],[1,0,0,0,1,0],[1,0,0,0,1,0],
             [1,0,0,1,0,0],[1,0,0,1,0,0],[0,1,0,1,0,0],
              [0,1,0,0,1,0],[0,1,0,0,1,0],[0,1,0,0,0,1],
              [0,1,0,0,0,1],[0,0,1,0,0,1],[0,0,1,0,1,0],
              [0,0,1,0,1,0],[0,0,1,0,0,1],[0,0,1,0,0,1]])
Y = np.array([-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1])
#Y = np.array([0,0,1,1,0,0,0,1,1,1,1,1,1,1,0])

#single-predict
#x = np.array([[0,1,0,1,0,0]])
#multi-predict
x = np.array([[0,1,0,1,0,0],[1,0,0,0,1,0],[0,0,1,0,1,0]])

clf = NormBayes()
clf.fit(X,Y)
print clf.predict(x)

[(-1, 0.066666666666666666), (-1, 0.066666666666666666), (1, 0.11851851851851851)]

例2.貸款申請樣本數據表code

ID	年齡	有工做	有本身的房子	信貸狀況	類別
1	青年	否	否	通常	否
2	青年	否	否	好	否
3	青年	是	否	好	是
4	青年	是	是	通常	是
5	青年	否	否	通常	否
6	中年	否	否	通常	否
7	中年	否	否	好	否
8	中年	是	是	好	否
9	中年	否	是	很是好	是
10	中年	否	是	很是好	是
11	老年	否	是	很是好	是
12	老年	否	是	好	是
13	老年	是	否	好	是
14	老年	是	否	很是好	是
15	老年	否	否	通常	否

試預測：$x = (老年，是，否，通常，是)$ 是否發放貸款orm

X = np.array([[1,0,0,0,0,1,0,0,0],[1,0,0,0,0,0,1,0,0],[1,0,0,1,0,0,1,0,1],
             [1,0,0,1,1,1,0,0,1],[1,0,0,0,0,1,0,0,0],[0,1,0,0,0,1,0,0,0],
             [0,1,0,0,0,0,1,0,0],[0,1,0,1,1,0,1,0,0],[0,1,0,0,1,0,0,1,1],
             [0,1,0,0,1,0,0,1,1],[0,0,1,0,1,0,0,1,1],[0,0,1,0,1,0,1,0,1],
             [0,0,1,1,0,0,1,0,1],[0,0,1,1,0,0,0,1,1],[0,0,1,0,0,1,0,0,0]])
Y = np.array([0,0,1,1,0,0,0,1,1,1,1,1,1,1,0])
x = np.array([[0,0,1,1,0,1,0,0,1]])

clf = NormBayes()
clf.fit(X,Y)
print clf.predict(x)

[(1, 0.014631915866483762)]

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。