From http://www.ruanyifeng.com/blog/2013/03/tf-idf.htmlhtml
考慮「詞」,分母也能夠是「文章總詞數」。算法
若是某個詞比較少見,可是它在這篇文章中屢次出現,那麼它極可能就反映了這篇文章的特性,正是咱們所須要的關鍵詞。app
"逆文檔頻率"(Inverse Document Frequency,縮寫爲IDF),它的大小與一個詞的常見程度成反比。less
考慮「文檔」。ide
知道了"詞頻"(TF)和"逆文檔頻率"(IDF)之後,將這兩個值相乘,就獲得了一個詞的TF-IDF值。某個詞對文章的重要性越高,它的TF-IDF值就越大。post
優勢: 是簡單快速,結果比較符合實際狀況。性能
缺點: 是,單純以"詞頻"衡量一個詞的重要性,不夠全面,ui
From http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.htmlthis
先從簡單的句子着手。atom
句子A:我喜歡看電視,不喜歡看電影。
句子B:我不喜歡看電視,也不喜歡看電影。
基本思路是:若是這兩句話的用詞越類似,它們的內容就應該越類似。所以,能夠從詞頻入手,計算它們的類似程度。
Step 1,分詞。
句子A:我/喜歡/看/電視,不/喜歡/看/電影。
句子B:我/不/喜歡/看/電視,也/不/喜歡/看/電影。
Step 2,列出全部的詞。
我,喜歡,看,電視,電影,不,也。
Step 3,計算詞頻。
句子A:我 1,喜歡 2,看 2,電視 1,電影 1,不 1,也 0。
句子B:我 1,喜歡 2,看 2,電視 1,電影 1,不 2,也 1。
Step 4,寫出詞頻向量。
句子A:[1, 2, 2, 1, 1, 1, 0]
句子B:[1, 2, 2, 1, 1, 2, 1]
到這裏,問題就變成了如何計算這兩個向量的類似程度。
兩條線段之間造成一個夾角,咱們能夠經過夾角的大小,來判斷向量的類似程度。夾角越小,就表明越類似。
以二維空間爲例,上圖的a和b是兩個向量,咱們要計算它們的夾角θ。參考:餘弦定理
餘弦值越接近1,就代表夾角越接近0度,也就是兩個向量越類似,這就叫"餘弦類似性"。因此,上面的句子A和句子B是很類似的,事實上它們的夾角大約爲20.3度。
由此,咱們就獲得了"找出類似文章"的一種算法:
(1)使用TF-IDF算法,找出兩篇文章的關鍵詞;
(2)每篇文章各取出若干個關鍵詞(好比20個),合併成一個集合,計算每篇文章對於這個集合中的詞的詞頻(爲了不文章長度的差別,可使用相對詞頻);
(3)生成兩篇文章各自的詞頻向量;
(4)計算兩個向量的餘弦類似度,值越大就表示越類似。
"餘弦類似度"是一種很是有用的算法,只要是計算兩個向量的類似程度,均可以採用它。
From http://blog.csdn.net/ehomeshasha/article/details/35988111
多項式係數表達爲:
(n爲全部單詞組成的字典大小),
表明:
某個單詞在類別c中出現的可能性,因此其加和必然爲1,公式爲
MAP假設以及將值取log:
在上面的公式中,爲單詞i屬於類別c的權重,如何選取以及就是學者們研究的重點,這關係到Naive Bayes分類器的性能。
/* 具體詳見原連接 */
From http://stats.stackexchange.com/questions/126009/complement-naive-bayes
Let's say we have three documents with the following words:
// training Doc 1: "Food" occurs 2 times, "Meat" occurs 1 time, "Brain" occurs 1 time --> 4 Class of Doc 1: "Health"
Doc 2: "Food" occurs 1 time, "Meat" occurs 1 time, "Kitchen" occurs 9 times, "Job" occurs 5 times. --> 16 Class of Doc 2: "Butcher" Doc 3: "Food" occurs 2 times, "Meat" occurs 1 time, "Job" occurs 1 time. --> 4 Class of Doc 3: "Health"
Total word count in class 'Health' - (2+1+1)+(2+1+1) = 8
Total word count in class 'Butcher' - (1+1+9+5) = 16
So we have two possible y classes: (y=Health) and (y=Butcher) , with prior probabilities thus:
p(y=Health) = 2/3 (2 out of 3 docs are about Health)
p(y=Butcher) = 1/3
Now, for Complement Normal Naive Bayes, instead of calculating the likelihood of a word occuring in a class,
we calculate the likelihood that it occurs in other classes. So, we would proceed to calculate the word-class dependencies thus:
Complement Probability of word 'Food' with class 'Health':
p( w=Food | ŷ=Health ) = 1/16
See? 'Food' occurs 1 time in total for all classes NOT health, and the number of words in class NOT health is 16.
Complement Probability of word 'Food' with class 'Butcher':
p( w=Food | ŷ=Butcher ) = (2+2)/8 = 0.5
For others,
p( w=Kitchen | ŷ =Health ) = 9/16 p( w=Kitchen | ŷ =Butcher) = 0/8 = 0 p( w=Meat | ŷ =Health ) = 1/16 p( w=Meat | ŷ =Butcher ) = 2/8
...and so forth
New doc: "Food" - 1, "Job" - 1, "Meat" - 1
Then, say we had a new document containing the following:
(1)
(2)
We would predict the class of this new doc by doing the following:
which gives us
Let's work it out - this will give us:
...and likewise for the other classes.
So, the one with the lower probability (minimum value) is said to be the class it belongs to - in this case,
our new doc will be classified as belonging to Health.
We DON'T use the one with the maximum probability because for the Complement Naive Bayes Algorithm, we take it - a higher value - to mean that it is highly likely that a document with these words does NOT belong to that class.
Obviously, this example is, again, highly contrived, and we should even talk about Laplacian smoothing. But hope this helps you have a working idea on which you can build!
本文考慮了先驗分佈
Ref: [Bayesian] 「我是bayesian我怕誰」系列 - Naive Bayes with Prior
多項式分佈的先驗:Dir分佈;二項分佈的先驗:Bate分佈。
而後,獲得倆參數的後驗:
參數後驗分佈的指望就是預測值:
def naive_bayes_posterior_mean(x, y, alpha=1, beta=1): n_class = y.shape[1] n_feat = x.shape[1] # alpha的設置,不須要搞一個長向量,由於初始是平均的;
# but for beta, we must be explicit
beta = np.ones(2) * beta # 直接套入pi的後驗結論
pi_counts = np.sum(y, axis=0) + alpha pi = pi_counts/np.sum(pi_counts) # 直接套入theta的後驗結論
theta = np.zeros((n_feat, n_class)) for cls in range(n_class): docs_in_class = (y[:, cls]==1) class_feat_count = x[docs_in_class, :].sum(axis=0) theta[:, cls] = (class_feat_count + beta[1])/(docs_in_class.sum() + beta.sum()) return pi, theta
pi_bar, theta_bar = naive_bayes_posterior_mean(xtrain, ytrain, alpha=1, beta=1)
print(pi_bar) # Cat(y|pi)
print(theta_bar) # Ber(xj|thetajc)
Result:
[ 0.23479491 0.14144272 0.4893918 0.08910891 0.04526167]
[[ 0.0239521 0.04950495 0.00576369 0.015625 0.03030303] # 生成式機率:第c類的topic包含這個單詞的機率是多大 [ 0.00598802 0.01980198 0.01440922 0.015625 0.06060606] [ 0.00598802 0.03960396 0.01152738 0.03125 0.03030303] ..., [ 0.01796407 0.00990099 0.01440922 0.03125 0.06060606] [ 0.39520958 0.45544554 0.5389049 0.4375 0.42424242] # stop word可能會是這樣的機率情況,在每一個topic中以較高的機率存在其身影 [ 0.00598802 0.00990099 0.00288184 0.015625 0.03030303]]
from scipy.misc import logsumexp def predict_class_prob(x, pi, theta):
# 哪一個類機率大,就預測是哪一個類
class_feat_l = np.zeros_like(theta) # calculations in log space to avoid underflow class_feat_l[x==1, :] = np.log(theta[x==1, :]) class_feat_l[x==0, :] = np.log(1 - theta[x==0, :]) class_l = class_feat_l.sum(axis=0) + np.log(pi) # class_1: predict的log後的結果# logsumexp 等價於 np.log(np.sum(np.exp(a)))
return np.exp(class_l - logsumexp(class_l)) #--> 返回的就是歸一化後的機率比值
# 這個返回決策結果 def predict_class(x, pi, theta): """ Given a feature vector `x`, class probabilities `pi` and class-conditional feature probabilities `theta`, return a one-hot encoded MAP class-membership prediction. """ probs = predict_class_prob(x, pi, theta) prediction = np.zeros_like(probs) prediction[np.argmax(probs)] = 1
return prediction
def predictive_accuracy(xdata, ydata, predictor, *args): """ Given an N-by-D array of features `xdata`, an N-by-C array of one-hot-encoded true classes `ydata` and a predictor function `predictor`, return the proportion of correct predictions. We accept an additional argument list `args` that will be passed to the predictor function. """ correct = np.zeros(xdata.shape[0]) for i, x in enumerate(xdata): prediction = predictor(x, *args) correct[i] = np.all(ydata[i, :] == prediction) # 判斷運算符,拿出ydata的對應的行,與prediction的五個元素比較後都對才叫對 return correct.mean()
# 預測第48條樣本
categorical_bar(predict_class_prob( xtest[48,:], pi_bar, theta_bar), alpha=0.5, color='orange' ); categorical_bar(ytest[48,:], alpha=0.5, color='blue');
# 預測全部樣本,查看預測整體效果
train_correct_bayes = predictive_accuracy(xtrain, ytrain, lambda x: predict_class(x, pi_bar, theta_bar)) print("Full Bayes In-sample proportion correct: {:.3}".format(train_correct_bayes)) test_correct_bayes = predictive_accuracy(xtest, ytest, lambda x: predict_class(x, pi_bar, theta_bar)) print("Full Bayes Out-of-sample proportion correct: {:.3}".format(test_correct_bayes))
好比擴大:α=(100,100,…,100).
pi_bar_10, theta_bar_10 = naive_bayes_posterior_mean(xtrain, ytrain, alpha=100, beta=1) categorical_bar(pi_bar_10, color='red', alpha=0.5, label=r"$\bar{\pi}'$"); categorical_bar(pi_hat, color='magenta', alpha=0.5, label=r'$\hat{\pi}$'); pl.legend();
Result:看上去更加均勻,成爲了pi的後驗指望的主導。
Dataset: http://qwone.com/~jason/20Newsgroups/
This is only a test report for naive bayes algorithm on email classification, which will help you to further understand Naive Bayes.
The goal is to implement a version of the Naive Bayes classifier and apply it to the text documents in the 20 newgroups data set, which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Here, three Impact Factors are considered to improve the accuracy of classification.
(1) Adopt the same approach as described in Chapter 6 of Mitchell's book as following.
(2) Filter out the high frequency words which are useless for recognition, such as 'the', 'is', etc. The word found in more groups, its weight value will be smaller. Thus the method of TF-IDF (term frequency–inverse document frequency) will be considered.
(3) The probability of a word occurring is dependent of its position within the text, so next, the weight value of position for each word is considered.
In order to keep implementation simple, it simply assumes that:
(a) the importance of words in an email is descending,
(b) but key words will be repeated at the end of email.
Gaussian distributions is used to describe the importance. The horizontal axis shows word position in main body of email. The vertical axis represents the importance. Thus, two Gaussian distributions will represent these two assumptions respectively. The importance of one word will be the result of accumulation between two values from these two assumptions.
Two group of proper parameters for two Gaussian distributions will be determined based on lots of test.
Otherwise, we use log rule as below to transform division to subtraction for avoiding the problem of arithmetic underflow caused by the tiny probability.
log(a/b) = log(a) - log(b)
Next, according to the log rule, three Impact Factors can determine the word probability in each group in the form as following:
Word Probability = log(「impact factor 1」) + log(「impact factor 2」) + log(「impact factor 3」)
Considering that the importance of these Impact Factors are different, thus we need to find proper weight value for each of them to get a maximum of accuracy of classification. So, the form will be as following:
Word Probability = weight1×log(「impact factor 1」) + weight2×log(「impact factor 2」) + weight3×log(「impact factor 3」)
After implement of 1st Impact Factor, the accuracy of classification is about 73.81%.
After implement of 2nd Impact Factor, surprisingly there is no improvement on accuracy.
In 3rd Impact Factor, there are three groups of uncertain values that we should test to find the proper values.
Finally, we have got the proper parameters and weight values to achieve the accuracy of 81.12%.
TF-IDF doesn't work here. It possibly shows that the high frequency words are evenly distributed in each group.
Here, it is only a simple assumption on the importance of word position and this Impact Factor has good performance on accuracy improvement. I think, there are still better assumptions which have better performance, for example: 「the first paragraph and the first sentence in each paragraph are important.」 However, this assumption needs much more time to implement due to the non-unified format of main body in email.
The Naive Bayes classifier has a good performance on email classification and the analysis on word position will further improve the accuracy. The probability of guessing the group is 5% (1/20) and the algorithm above achieves 81.12%.
Mitchell, T. "Machine Learning". McGraw Hill, 1997.