SRILM語言模型格式解讀

先看一下語言模型的輸出格式html

\data\  
ngram 1=64000  
ngram 2=522530  
ngram 3=173445  
  
\1-grams:  
-5.24036        'cause  -0.2084827  
-4.675221       'em     -0.221857  
-4.989297       'n      -0.05809768  
-5.365303       'til    -0.1855581  
-2.111539       </s>    0.0  
-99     <s>     -0.7736475  
-1.128404       <unk>   -0.8049794  
-2.271447       a       -0.6163939  
-5.174762       a's     -0.03869072  
-3.384722       a.      -0.1877073  
-5.789208       a.'s    0.0  
-6.000091       aachen  0.0  
-4.707208       aaron   -0.2046838  
-5.580914       aaron's -0.06230035  
-5.789208       aarons  -0.07077657  
-5.881973       aaronson        -0.2173971  
(注:上面的值都是以10爲底的對數值)

 ARPA是經常使用的語言模型存儲格式, 由主要由兩部分構成。模型文件頭和模型文件體構成。bash

 

上面是一個語言模型的一部分,三元語言模型的綜合格式以下:測試

\data
ngram 1=nr # 一元語言模型
ngram 2=nr # 二元語言模型
ngram 3=nr # 三元語言模型
 
\1-grams:
pro_1 word1 back_pro1
 
\2-grams:
pro_2 word1 word2 back_pro2
 
\3-grams:
pro_3 word1 word2 word3
 
\end\

 

第一項表示ngram的條件機率,就是P(wordN | word1,word2,。。。,wordN-1)。 spa

第二項表示ngram的詞。翻譯

最後一項是回退的權重。debug

 

舉例來講,對於三個連續的詞來講,咱們計算三個詞一塊兒出現的機率:3d

P(word3|word1,word2)  code

表示word1和word2出現的狀況下word3出現的機率,好比P(錘|王,大)的意思是已經出現了「王大」兩個字,後面是"錘"的機率,這個機率這麼計算:htm

if(存在(word1,word2,word3)的三元模型){

    return pro_3(word1,word2,word3) ;

}else if(存在(word1,word2)二元模型){

    return back_pro2(word1,word2)*P(word3|word2) ;  #實際使用的時候是對數,就直接相加

}else{
    
    return P(word3 | word2);

}

 

上面的計算又集中在計算P(word3 | word2)的機率上,就是若是不存在王大錘的三元模型,此時無論何種路徑,都要計算P(word3 | word2) 的機率,計算以下:blog

if(存在(word2,word3)的二元模型){

    return pro_2(word2,word3);

}else{
    
    return back_pro2(word2)*pro_1(word3) ; 

}

這個計算的,咱們拿個具體的例子來演示一下 :

假設這是咱們測的一句3-gram PPL

放 一首 音樂 好 嗎
    p( 放 | <s> )   = [2gram] 0.00584747 [ -2.23303 ]
    p( 一首 | 放 ...)   = [3gram] 0.00935384 [ -2.02901 ]
    p( 音樂 | 一首 ...)     = [3gram] 0.610533 [ -0.214291 ]
    p( 好 | 音樂 ...)   = [2gram] 2.31318e-06 [ -5.63579 ]
    p( 嗎 | 好 ...)     = [3gram] 0.999717 [ -0.000122777 ]
    p( </s> | 嗎 ...)   = [3gram] 0.999976 [ -1.04858e-05 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -10.1123 ppl= 48.4592 ppl1= 105.306

這是我截取的語言模型裏的機率,對照上面的解釋,咱們知道左邊是機率,右邊是回退機率,都以log10P 來計

-2.233032   <s> 放  -2.999944
-2.02901    <s> 放 一首
-0.7478155  一首 音樂   -3.733402
-1.902389   音樂 好 -3.254402
-0.2142911  放 一首 音樂

對着看:

1.p( 放 | <s> )=p(<s> 放)=  -2.233032   OK

2.p( 一首 | 放 ...)=p( 一首 | <s>, 放) = p(<s> 放 一首)=-2.02901   OK

3.p( 音樂 | 一首 ...)=p( 音樂 | 放 , 一首 )=p(放 一首 音樂) = -0.2142911  OK

最難的看下 p( 好 | 音樂 ...),由於這裏顯示的是2-gram ,而實際上咱們是測的3-gram,就要用到上面的公式了:

4.p( 好 | 音樂 ...)=p( 好 | 一首,音樂 )=p(一首 音樂 好)   #注意,由於沒有p(一首 音樂 好) 的三元組,因此要回退了

=p(音樂 好) x back_p(一首 音樂)= -1.902389 + -3.733402 = -5.635791   OK

 下面的就不一一演示了,這樣就知道PPL的每一步是怎麼算出來的,也別覺得PPL上面顯示的2-gram,就只跟前一個有關係,其實你算的是3-gram,就都跟前兩個詞有關係,只不過有些算的是回退的機率。

那麼回退的這個機率公式是什麼?

若是語料裏的詞不在wordlist裏面呢?語言模型會有什麼變化?

作個實驗:

語料   welcome.corpus.pat

歡迎你
歡迎加入你們庭
歡迎加入小組

生成詞表   small.wlist

#!/bin/bash
 ./tools/wrdmrgseg_ggl-v3.sh ./118k-kuwomusic.new.vocab.dict2 welcome.corpus.pat corpus.pat.wseg  #分詞
 rm small.wlist
 echo '</s>' >> small.wlist
 echo '<s>' >> small.wlist
 LANG=C;LC_ALL=C;awk '{for(i=1;i<=NF;i++){print$i}}' corpus.pat.wseg | sort -u >>  small.wlist     

詞表內容:(龜速是隨便加的一個詞,無視他)

</s>
<s>

加入
你們庭
小組
歡迎
龜速

生成語言模型 welcome.lm1

ngram-count -order 3 -debug 1 -text corpus.pat.wseg  -vocab small.wlist -gt3min 1  -lm lm/welcome.lm1

而後咱們換一個語料,詞表不變:

歡迎你
歡迎加入你們庭
歡迎加入小組
加入大胡歡迎

再生成語言模型,welcome.lm2

比較一下:

很明顯能夠看出,右邊多了兩個,紅色矩形標出,這就是咱們多加了的那句語料形成的,而大胡在詞表中未出現,因此在這裏隔開了,注意,不是換行,對  <s> 加入 只有sentence start ,而沒有 sentence end

語言模型困惑度

ngram -ppl devtest2006.en -order 3 -lm europarl.en.lm > europarl.en.lm.ppl


  其中測試集採用 wmt08 用於機器翻譯的測試集 devtest2006.en,2000 句;

參數 - ppl 爲對測試集句子進行評分 (logP(T),其中 P(T) 爲全部句子的機率乘積)和計算測試集困惑度的參數;

europarl.en.lm.ppl 爲輸出結果文件;其餘參數同上。輸出文件結果以下:

 file devtest2006.en: 2000 sentences, 52388 words, 249 OOVs
 0 zeroprobs, logprob= -105980 ppl= 90.6875 ppl1= 107.805


  第一行文件 devtest2006.en 的基本信息:2000 句,52888 個單詞,249 個未登陸詞;
  第二行爲評分的基本狀況:無 0 機率;logP(T)=-105980,ppl==90.6875, ppl1= 107.805,均爲困惑度。其公式稍有不一樣,以下:

ppl=10^{-{logP(T)}/{Sen+Word}}; ppl1=10^{-{logP(T)}/Word}

  其中 Sen 和 Word 分別表明句子和單詞數。 

咱們本身實操一下:

我 要 去 上海 明珠路 五百 五 十五 弄
        p( 我 | <s> )   = [2gram] 0.126626 [ -0.897477 ]
        p( 要 | 我 ...)         = [2gram] 0.194285 [ -0.71156 ]
        p( 去 | 要 ...)         = [2gram] 0.205612 [ -0.686952 ]
        p( 上海 | 去 ...)       = [2gram] 0.00419823 [ -2.37693 ]
        p( 明珠路 | 上海 ...)   = [2gram] 6.65196e-06 [ -5.17705 ]
        p( 五百 | 明珠路 ...)   = [2gram] 0.00264877 [ -2.57696 ]
        p( 五 | 五百 ...)       = [2gram] 0.0768465 [ -1.11438 ]
        p( 十五 | 五 ...)       = [2gram] 0.0159186 [ -1.79809 ]
        p( 弄 | 十五 ...)       = [2gram] 0.0543947 [ -1.26444 ]
        p( </s> | 弄 ...)       = [2gram] 0.0667069 [ -1.17583 ]
1 sentences, 9 words, 0 OOVs
0 zeroprobs, logprob= -17.7797 ppl= 59.9746 ppl1= 94.519
>>> pow(10,-1.0/10*(-17.7797))
59.97496455867574
>>> pow(10,-1.0/9*(-17.7797))
94.51967555580339

能夠看下這邊的詳細公式:

logprob是每一個n-元組機率的對數和,在上面的示例中,確實是最後一列之和即爲logprob

S 表明 sentence,N 是句子長度,p(wi) 是第 i 個詞的機率。N個相乘,再開N次方根,起到了規約的做用。

 

模型插值後的權重變化

文本

$ head l1.wseg  l2.wseg
==> l1.wseg <==
導航 去 上海
導航 去 蘇州
導航 去 北京

==> l2.wseg <==
聽 周杰倫 的 歌曲
聽 汪峯 的 歌曲
聽 劉德華 的 歌曲

 

\data\                                                 |\data\                                                 |-1.380211   蘇州    -0.07638834                        
ngram 1=7                                              |ngram 1=8                                              |                                                       
ngram 2=8                                              |ngram 2=9                                              |\2-grams:                                              
ngram 3=7                                              |ngram 3=10                                             |-0.4259687  <s> 聽  0.05551729                         
                                                       |                                                       |-0.4259687  <s> 導航    0                              
\1-grams:                                              |\1-grams:                                              |-0.455932   上海 </s>                                  
-0.60206    </s>                                       |-0.69897    </s>                                       |-0.4259687  劉德華 的   0.07918127                     
-99 <s> -0.4771213                                     |-99 <s> -0.50515                                       |-0.455932   北京 </s>                                  
-1.079181   上海    -0.1760913                         |-1.176091   劉德華  -0.50515                           |-0.90309    去 上海 0                                  
-1.079181   北京    -0.1760913                         |-0.69897    聽  -0.9030898                             |-0.90309    去 北京 0                                  
-0.60206    去  -0.4771211                             |-1.176091   周杰倫  -0.50515                           |-0.90309    去 蘇州 0                                  
-0.60206    導航    -0.4771213                         |-0.69897    歌曲    -0.50515                           |-0.8239088  聽 劉德華   0.07918127                     
-1.079181   蘇州    -0.1760913                         |-1.176091   汪峯    -0.50515                           |-0.8239088  聽 周杰倫   0.07918127                     
                                                       |-0.69897    的  -0.50515                               |-0.8239088  聽 汪峯 0.07918127                         
\2-grams:                                              |                                                       |-0.4259687  周杰倫 的   0.07918127                     
-0.1249387  <s> 導航    0                              |\2-grams:                                              |-0.4259687  導航 去 0                                  
-0.30103    上海 </s>                                  |-0.1249387  <s> 聽  0.3979399                          |-0.30103    歌曲 </s>                                  
-0.30103    北京 </s>                                  |-0.1249387  劉德華 的   0.30103                        |-0.4259687  汪峯 的 0.07918127                         
-0.60206    去 上海 0                                  |-0.5228788  聽 劉德華   0.30103                        |-0.4259687  的 歌曲 0                                  
-0.60206    去 北京 0                                  |-0.5228788  聽 周杰倫   0.30103                        |-0.455932   蘇州 </s>                                  
-0.60206    去 蘇州 0                                  |-0.5228788  聽 汪峯 0.30103                            |                                                       
-0.1249387  導航 去 0                                  |-0.1249387  周杰倫 的   0.30103                        |\3-grams:                                              
-0.30103    蘇州 </s>                                  |-0.1249387  歌曲 </s>                                  |-0.455932   去 上海 </s>                               
                                                       |-0.1249387  汪峯 的 0.30103                            |-0.60206    聽 劉德華 的                               
\3-grams:                                              |-0.1249387  的 歌曲 0                                  |-0.455932   去 北京 </s>                               
-0.30103    去 上海 </s>                               |                                                       |-0.90309    導航 去 上海                               
-0.30103    去 北京 </s>                               |\3-grams:                                              |-0.90309    導航 去 北京                               
-0.60206    導航 去 上海                               |-0.30103    聽 劉德華 的                               |-0.90309    導航 去 蘇州                               
-0.60206    導航 去 北京                               |-0.60206    <s> 聽 劉德華                              |-0.90309    <s> 聽 劉德華                              
-0.60206    導航 去 蘇州                               |-0.60206    <s> 聽 周杰倫                              |-0.90309    <s> 聽 周杰倫                              
-0.1249387  <s> 導航 去                                |-0.60206    <s> 聽 汪峯                                |-0.90309    <s> 聽 汪峯                                
-0.30103    去 蘇州 </s>                               |-0.30103    聽 周杰倫 的                               |-0.60206    聽 周杰倫 的                               
                                                       |-0.1249387  的 歌曲 </s>                               |-0.4259687  <s> 導航 去                                
\end\                                                  |-0.30103    聽 汪峯 的                                 |-0.30103    的 歌曲 </s>                               
~                                                      |-0.30103    劉德華 的 歌曲                             |-0.60206    聽 汪峯 的                                 
~                                                      |-0.30103    周杰倫 的 歌曲                             |-0.60206    劉德華 的 歌曲                             
~                                                      |-0.30103    汪峯 的 歌曲                               |-0.60206    周杰倫 的 歌曲                             
~                                                      |                                                       |-0.60206    汪峯 的 歌曲                               
~                                                      |\end\                                                  |-0.455932   去 蘇州 </s>                               
~                                                      |~                                                      |                                                       
~                                                      |~                                                      |\end\                                                  
~                                                      |~                                                      |~                                                      
~                                                      |~                                                      |~                                                      
~                                                      |~                                                      |~                                                      
l1.lm                                                   l2.lm                                                   l3.lm

 這個簡單的例子能夠看到,插值後的模型,元組的機率會變差,符合正常的直觀理解。

相關文章
相關標籤/搜索