轉載請註明:電子科技大學EClab——落葉花開http://www.cnblogs.com/nlp-yekai/p/3816532.html html
困惑度通常在天然語言處理中用來衡量訓練出的語言模型的好壞。在用LDA作主題和詞聚類時,原做者D.Blei就是採用了困惑度來肯定主題數量。文章中的公式爲:python
perplexity=exp^{ - (∑log(p(w))) / (N) }app
其中,P(W)是指的測試集中出現的每個詞的機率,具體到LDA的模型中就是P(w)=∑z p(z|d)*p(w|z)【z,d分別指訓練過的主題和測試集的各篇文檔】。分母的N是測試集中出現的全部詞,或者說是測試集的總長度,不排重。測試
於是python程序代碼塊須要包括幾個方面:ui
1.對訓練的LDA模型,將Topic-word分佈文檔轉換成字典,方便查詢機率,即計算perplexity的分子spa
2.統計測試集長度,即計算perplexity的分母code
3.計算困惑度orm
4.對於不一樣的Topic數量的模型,計算的困惑度,畫折線圖。htm
python代碼以下:blog
1 # -*- coding: UTF-8-*- 2 import numpy 3 import math 4 import string 5 import matplotlib.pyplot as plt 6 import re 7 8 def dictionary_found(wordlist): #對模型訓練出來的詞轉換成一個詞爲KEY,機率爲值的字典。 9 word_dictionary1={} 10 for i in xrange(len(wordlist)): 11 if i%2==0: 12 if word_dictionary1.has_key(wordlist[i])==True: 13 word_probability=word_dictionary1.get(wordlist[i]) 14 word_probability=float(word_probability)+float(wordlist[i+1]) 15 word_dictionary1.update({wordlist[i]:word_probability}) 16 else: 17 word_dictionary1.update({wordlist[i]:wordlist[i+1]}) 18 else: 19 pass 20 return word_dictionary1 21 22 def look_into_dic(dictionary,testset): #對於測試集的每個詞,在字典中查找其機率。 23 '''Calculates the TF-list for perplexity''' 24 frequency=[] 25 letter_list=[] 26 a=0.0 27 for letter in testset.split(): 28 if letter not in letter_list: 29 letter_list.append(letter) 30 letter_frequency=(dictionary.get(letter)) 31 frequency.append(letter_frequency) 32 else: 33 pass 34 for each in frequency: 35 if each!=None: 36 a+=float(each) 37 else: 38 pass 39 return a 40 41 42 def f_testset_word_count(testset): #測試集的詞數統計 43 '''reture the sum of words in testset which is the denominator of the formula of Perplexity''' 44 testset_clean=testset.split() 45 return (len(testset_clean)-testset.count("\n")) 46 47 def f_perplexity(word_frequency,word_count): #計算困惑度 48 '''Search the probability of each word in dictionary 49 Calculates the perplexity of the LDA model for every parameter T''' 50 duishu=-math.log(word_frequency) 51 kuohaoli=duishu/word_count 52 perplexity=math.exp(kuohaoli) 53 return perplexity 54 55 def graph_draw(topic,perplexity): #作主題數與困惑度的折線圖 56 x=topic 57 y=perplexity 58 plt.plot(x,y,color="red",linewidth=2) 59 plt.xlabel("Number of Topic") 60 plt.ylabel("Perplexity") 61 plt.show() 62 63 64 topic=[] 65 perplexity_list=[] 66 f1=open('/home/alber/lda/GibbsLDA/jd/test.txt','r') #測試集目錄 67 testset=f1.read() 68 testset_word_count=f_testset_word_count(testset) #call the function to count the sum-words in testset 69 for i in xrange(14): 70 dictionary={} 71 topic.append(5*(3i+1)) #模型文件名的迭代公式 72 trace="/home/alber/lda/GibbsLDA/jd/stats/model-final-"+str(5*(i+1))+".txt" #模型目錄 73 f=open(trace,'r') 74 text=f.readlines() 75 word_list=[] 76 for line in text: 77 if "Topic" not in line: 78 line_clean=line.split() 79 word_list.extend(line_clean) 80 else: 81 pass 82 word_dictionary=dictionary_found(word_list) 83 frequency=look_into_dic(word_dictionary,testset) 84 perplexity=f_perplexity(frequency,testset_word_count) 85 perplexity_list.append(perplexity) 86 graph_draw(topic,perplexity_list)
下面是畫出的折線圖,在拐點附近再調整參數(固然與測試集有關,有圖爲證~~),尋找最優的主題數。實驗證實,只要Topic選取數量在其附近,主題抽取通常比較理想。
本人也是新手開始做研究~程序或者其餘地方有錯誤的,但願你們指正。