先看一下語言模型的輸出格式html
\data\ ngram 1=64000 ngram 2=522530 ngram 3=173445 \1-grams: -5.24036 'cause -0.2084827 -4.675221 'em -0.221857 -4.989297 'n -0.05809768 -5.365303 'til -0.1855581 -2.111539 </s> 0.0 -99 <s> -0.7736475 -1.128404 <unk> -0.8049794 -2.271447 a -0.6163939 -5.174762 a's -0.03869072 -3.384722 a. -0.1877073 -5.789208 a.'s 0.0 -6.000091 aachen 0.0 -4.707208 aaron -0.2046838 -5.580914 aaron's -0.06230035 -5.789208 aarons -0.07077657 -5.881973 aaronson -0.2173971 (注:上面的值都是以10爲底的對數值)
ARPA是經常使用的語言模型存儲格式, 由主要由兩部分構成。模型文件頭和模型文件體構成。bash
上面是一個語言模型的一部分,三元語言模型的綜合格式以下:測試
\data ngram 1=nr # 一元語言模型 ngram 2=nr # 二元語言模型 ngram 3=nr # 三元語言模型 \1-grams: pro_1 word1 back_pro1 \2-grams: pro_2 word1 word2 back_pro2 \3-grams: pro_3 word1 word2 word3 \end\
第一項表示ngram的條件機率,就是P(wordN | word1,word2,。。。,wordN-1)。 spa
第二項表示ngram的詞。翻譯
最後一項是回退的權重。debug
舉例來講,對於三個連續的詞來講,咱們計算三個詞一塊兒出現的機率:3d
P(word3|word1,word2) code
表示word1和word2出現的狀況下word3出現的機率,好比P(錘|王,大)的意思是已經出現了「王大」兩個字,後面是"錘"的機率,這個機率這麼計算:htm
if(存在(word1,word2,word3)的三元模型){ return pro_3(word1,word2,word3) ; }else if(存在(word1,word2)二元模型){ return back_pro2(word1,word2)*P(word3|word2) ; #實際使用的時候是對數,就直接相加 }else{ return P(word3 | word2); }
上面的計算又集中在計算P(word3 | word2)的機率上,就是若是不存在王大錘的三元模型,此時無論何種路徑,都要計算P(word3 | word2) 的機率,計算以下:blog
if(存在(word2,word3)的二元模型){ return pro_2(word2,word3); }else{ return back_pro2(word2)*pro_1(word3) ; }
這個計算的,咱們拿個具體的例子來演示一下 :
假設這是咱們測的一句3-gram PPL
放 一首 音樂 好 嗎 p( 放 | <s> ) = [2gram] 0.00584747 [ -2.23303 ] p( 一首 | 放 ...) = [3gram] 0.00935384 [ -2.02901 ] p( 音樂 | 一首 ...) = [3gram] 0.610533 [ -0.214291 ] p( 好 | 音樂 ...) = [2gram] 2.31318e-06 [ -5.63579 ] p( 嗎 | 好 ...) = [3gram] 0.999717 [ -0.000122777 ] p( </s> | 嗎 ...) = [3gram] 0.999976 [ -1.04858e-05 ] 1 sentences, 5 words, 0 OOVs 0 zeroprobs, logprob= -10.1123 ppl= 48.4592 ppl1= 105.306
這是我截取的語言模型裏的機率,對照上面的解釋,咱們知道左邊是機率,右邊是回退機率,都以log10P 來計
-2.233032 <s> 放 -2.999944 -2.02901 <s> 放 一首 -0.7478155 一首 音樂 -3.733402 -1.902389 音樂 好 -3.254402 -0.2142911 放 一首 音樂
對着看:
1.p( 放 | <s> )=p(<s> 放)= -2.233032 OK
2.p( 一首 | 放 ...)=p( 一首 | <s>, 放) = p(<s> 放 一首)=-2.02901 OK
3.p( 音樂 | 一首 ...)=p( 音樂 | 放 , 一首 )=p(放 一首 音樂) = -0.2142911 OK
最難的看下 p( 好 | 音樂 ...),由於這裏顯示的是2-gram ,而實際上咱們是測的3-gram,就要用到上面的公式了:
4.p( 好 | 音樂 ...)=p( 好 | 一首,音樂 )=p(一首 音樂 好) #注意,由於沒有p(一首 音樂 好) 的三元組,因此要回退了
=p(音樂 好) x back_p(一首 音樂)= -1.902389 + -3.733402 = -5.635791 OK
下面的就不一一演示了,這樣就知道PPL的每一步是怎麼算出來的,也別覺得PPL上面顯示的2-gram,就只跟前一個有關係,其實你算的是3-gram,就都跟前兩個詞有關係,只不過有些算的是回退的機率。
那麼回退的這個機率公式是什麼?
作個實驗:
語料 welcome.corpus.pat
歡迎你
歡迎加入你們庭
歡迎加入小組
生成詞表 small.wlist
#!/bin/bash ./tools/wrdmrgseg_ggl-v3.sh ./118k-kuwomusic.new.vocab.dict2 welcome.corpus.pat corpus.pat.wseg #分詞 rm small.wlist echo '</s>' >> small.wlist echo '<s>' >> small.wlist LANG=C;LC_ALL=C;awk '{for(i=1;i<=NF;i++){print$i}}' corpus.pat.wseg | sort -u >> small.wlist
詞表內容:(龜速是隨便加的一個詞,無視他)
</s>
<s>
你
加入
你們庭
小組
歡迎
龜速
生成語言模型 welcome.lm1
ngram-count -order 3 -debug 1 -text corpus.pat.wseg -vocab small.wlist -gt3min 1 -lm lm/welcome.lm1
而後咱們換一個語料,詞表不變:
歡迎你
歡迎加入你們庭
歡迎加入小組
加入大胡歡迎
再生成語言模型,welcome.lm2
比較一下:
很明顯能夠看出,右邊多了兩個,紅色矩形標出,這就是咱們多加了的那句語料形成的,而大胡在詞表中未出現,因此在這裏隔開了,注意,不是換行,對 <s> 加入 只有sentence start ,而沒有 sentence end
ngram -ppl devtest2006.en -order 3 -lm europarl.en.lm > europarl.en.lm.ppl
其中測試集採用 wmt08 用於機器翻譯的測試集 devtest2006.en,2000 句;
參數 - ppl 爲對測試集句子進行評分 (logP(T),其中 P(T) 爲全部句子的機率乘積)和計算測試集困惑度的參數;
europarl.en.lm.ppl 爲輸出結果文件;其餘參數同上。輸出文件結果以下:
file devtest2006.en: 2000 sentences, 52388 words, 249 OOVs
0 zeroprobs, logprob= -105980 ppl= 90.6875 ppl1= 107.805
第一行文件 devtest2006.en 的基本信息:2000 句,52888 個單詞,249 個未登陸詞;
第二行爲評分的基本狀況:無 0 機率;logP(T)=-105980,ppl==90.6875, ppl1= 107.805,均爲困惑度。其公式稍有不一樣,以下:
ppl=10^{-{logP(T)}/{Sen+Word}}; ppl1=10^{-{logP(T)}/Word}
其中 Sen 和 Word 分別表明句子和單詞數。
咱們本身實操一下:
我 要 去 上海 明珠路 五百 五 十五 弄 p( 我 | <s> ) = [2gram] 0.126626 [ -0.897477 ] p( 要 | 我 ...) = [2gram] 0.194285 [ -0.71156 ] p( 去 | 要 ...) = [2gram] 0.205612 [ -0.686952 ] p( 上海 | 去 ...) = [2gram] 0.00419823 [ -2.37693 ] p( 明珠路 | 上海 ...) = [2gram] 6.65196e-06 [ -5.17705 ] p( 五百 | 明珠路 ...) = [2gram] 0.00264877 [ -2.57696 ] p( 五 | 五百 ...) = [2gram] 0.0768465 [ -1.11438 ] p( 十五 | 五 ...) = [2gram] 0.0159186 [ -1.79809 ] p( 弄 | 十五 ...) = [2gram] 0.0543947 [ -1.26444 ] p( </s> | 弄 ...) = [2gram] 0.0667069 [ -1.17583 ] 1 sentences, 9 words, 0 OOVs 0 zeroprobs, logprob= -17.7797 ppl= 59.9746 ppl1= 94.519
>>> pow(10,-1.0/10*(-17.7797)) 59.97496455867574 >>> pow(10,-1.0/9*(-17.7797)) 94.51967555580339
能夠看下這邊的詳細公式:
logprob是每一個n-元組機率的對數和,在上面的示例中,確實是最後一列之和即爲logprob
S 表明 sentence,N 是句子長度,p(wi) 是第 i 個詞的機率。N個相乘,再開N次方根,起到了規約的做用。
文本
$ head l1.wseg l2.wseg ==> l1.wseg <== 導航 去 上海 導航 去 蘇州 導航 去 北京 ==> l2.wseg <== 聽 周杰倫 的 歌曲 聽 汪峯 的 歌曲 聽 劉德華 的 歌曲
\data\ |\data\ |-1.380211 蘇州 -0.07638834 ngram 1=7 |ngram 1=8 | ngram 2=8 |ngram 2=9 |\2-grams: ngram 3=7 |ngram 3=10 |-0.4259687 <s> 聽 0.05551729 | |-0.4259687 <s> 導航 0 \1-grams: |\1-grams: |-0.455932 上海 </s> -0.60206 </s> |-0.69897 </s> |-0.4259687 劉德華 的 0.07918127 -99 <s> -0.4771213 |-99 <s> -0.50515 |-0.455932 北京 </s> -1.079181 上海 -0.1760913 |-1.176091 劉德華 -0.50515 |-0.90309 去 上海 0 -1.079181 北京 -0.1760913 |-0.69897 聽 -0.9030898 |-0.90309 去 北京 0 -0.60206 去 -0.4771211 |-1.176091 周杰倫 -0.50515 |-0.90309 去 蘇州 0 -0.60206 導航 -0.4771213 |-0.69897 歌曲 -0.50515 |-0.8239088 聽 劉德華 0.07918127 -1.079181 蘇州 -0.1760913 |-1.176091 汪峯 -0.50515 |-0.8239088 聽 周杰倫 0.07918127 |-0.69897 的 -0.50515 |-0.8239088 聽 汪峯 0.07918127 \2-grams: | |-0.4259687 周杰倫 的 0.07918127 -0.1249387 <s> 導航 0 |\2-grams: |-0.4259687 導航 去 0 -0.30103 上海 </s> |-0.1249387 <s> 聽 0.3979399 |-0.30103 歌曲 </s> -0.30103 北京 </s> |-0.1249387 劉德華 的 0.30103 |-0.4259687 汪峯 的 0.07918127 -0.60206 去 上海 0 |-0.5228788 聽 劉德華 0.30103 |-0.4259687 的 歌曲 0 -0.60206 去 北京 0 |-0.5228788 聽 周杰倫 0.30103 |-0.455932 蘇州 </s> -0.60206 去 蘇州 0 |-0.5228788 聽 汪峯 0.30103 | -0.1249387 導航 去 0 |-0.1249387 周杰倫 的 0.30103 |\3-grams: -0.30103 蘇州 </s> |-0.1249387 歌曲 </s> |-0.455932 去 上海 </s> |-0.1249387 汪峯 的 0.30103 |-0.60206 聽 劉德華 的 \3-grams: |-0.1249387 的 歌曲 0 |-0.455932 去 北京 </s> -0.30103 去 上海 </s> | |-0.90309 導航 去 上海 -0.30103 去 北京 </s> |\3-grams: |-0.90309 導航 去 北京 -0.60206 導航 去 上海 |-0.30103 聽 劉德華 的 |-0.90309 導航 去 蘇州 -0.60206 導航 去 北京 |-0.60206 <s> 聽 劉德華 |-0.90309 <s> 聽 劉德華 -0.60206 導航 去 蘇州 |-0.60206 <s> 聽 周杰倫 |-0.90309 <s> 聽 周杰倫 -0.1249387 <s> 導航 去 |-0.60206 <s> 聽 汪峯 |-0.90309 <s> 聽 汪峯 -0.30103 去 蘇州 </s> |-0.30103 聽 周杰倫 的 |-0.60206 聽 周杰倫 的 |-0.1249387 的 歌曲 </s> |-0.4259687 <s> 導航 去 \end\ |-0.30103 聽 汪峯 的 |-0.30103 的 歌曲 </s> ~ |-0.30103 劉德華 的 歌曲 |-0.60206 聽 汪峯 的 ~ |-0.30103 周杰倫 的 歌曲 |-0.60206 劉德華 的 歌曲 ~ |-0.30103 汪峯 的 歌曲 |-0.60206 周杰倫 的 歌曲 ~ | |-0.60206 汪峯 的 歌曲 ~ |\end\ |-0.455932 去 蘇州 </s> ~ |~ | ~ |~ |\end\ ~ |~ |~ ~ |~ |~ ~ |~ |~ l1.lm l2.lm l3.lm
這個簡單的例子能夠看到,插值後的模型,元組的機率會變差,符合正常的直觀理解。