MIT天然語言處理第三講：機率語言模型

時間 2019-11-25

標籤 mit 天然語言處理第三機率模型简体版

原文原文鏈接

1、簡單介紹

a) 預測字符串機率

　i. 那一個字符串更有可能或者更符合語法

　　1. Grill doctoral candidates.

　　2. Grill doctoral updates.

　　(example from Lee 1997)

　ii. 向字符串賦予機率的方法被稱之爲語言模型（Methods for assigning probabilities to strings are called Language Models.）

b) 動機（Motivation）

　i. 語音識別，拼寫檢查，光學字符識別和其餘應用領域（Speech recognition, spelling correction, optical character recognition and other applications）

　ii. 讓E做爲物證（？不肯定翻譯），咱們須要決定字符串W是不是有E編碼而獲得的消息（Let E be physical evidence, and we need to determine whether the string W is the message encoded by E）

　iii. 使用貝葉斯規則（Use Bayes Rule）：

　　　　P(W/E)={P_{LM}(W)P(E/W)}/{P(E)}　

　其中P_{LM}(W)是語言模型機率

　iv. P_{LM}(W)提供了必要的消歧信息(P_{LM}(W)provides the information necessary for isambiguation (esp. when the physical evidence is not sufficient for disambiguation))

c) 如何計算（How to Compute it）?

　i. 樸素方法（Naive approach）:

　　1. 使用最大似然估計——字符串在語料庫S中存在次數的值由語料庫規模歸一化：

P_{MLE}(Grill~doctorate~candidates)={count(Grill~doctorate~candidates)}/delim{|}{S}{|}

　　2. 對於未知事件，最大似然估計P_{MLE}=0（For unseen events, P_{MLa Sparseness）

d) 兩個著名的句子（Two Famous Sentences）E}=0）

　　——數據稀疏問題比較「可怕」（Dreadful behavior in the presence of Dat

　i. 「It is fair to assume that neither sentence「Colorless green ideas sleep fu

　　riously」

　　nor

　　「Furiously sleep ideas green colorless」

　　... has ever occurred ... Hence, in any statistical model ... these　sentences will be ruled out on identical grounds as equally 「remote」 from English. Yet (1), though nonsensical, is grammatical, while (2) is not.」 [Chomsky 1957]

　ii. 注：這是喬姆斯基《句法結構》第9頁上的：下面的兩句話歷來沒有在一段英語談話中出現過，從統計角度看離英語一樣的「遙遠」，但只有句1是合乎語法的：

　　1) Colorless green ideas sleep furiously.

　　2) Furiously sleep ideas sleep green colorless .

　　「歷來沒有在一段英語談話中出現過」、「從統計角度看離英語一樣的‘遙遠’」要看從哪一個角度去看了，若是拋開具體的詞彙、從形類角度看，恐怕句1的統計頻率要高於句2並且在英語中出現過。

2、語言模型構造

a) 語言模型問題提出（The Language Modeling Problem）

　i. 始於一些詞聚集合（Start with some vocabulary）:

　　ν= {the, a, doctorate, candidate, Professors, grill, cook, ask, ...}

　ii. 獲得一個與詞聚集合v關的訓練樣本:

　　Grill doctorate candidate.

　　Cook Professors.

　　Ask Professors.

　　…...

　iii. 假設（Assumption）:訓練樣本是由一些隱藏的分佈P刻畫的

　iv. 目標（Goal）: 學習一個機率分佈P prime儘量的與P近似

　　　　　sum{x in v}{}{P prime (x)}=1, P prime (x) >=0

　　　　　P prime (candidates)=10^{-5}

　　　　　{P prime (ask~candidates)}=10^{-8}

b) 得到語言模型（Deriving Language Model）

　i. 向一組單詞序列w_{1}w_{2}...w_{n}賦予機率（Assign probability to a sequencew_{1}w_{2}...w_{n} ）

　ii. 應用鏈式法則（Apply chain rule）:

　　1. P(w1w2...wn)= P(w1|S)∗P(w2|S,w1)∗P(w3|S,w1,w2)...P(E|S,w1,w2,...,wn)

　　2. 基於「歷史」的模型(History-based model): 咱們從過去的事件中預測將來的事件

　　3. 咱們須要考慮多大範圍的上下文?

c) 馬爾科夫假設（Markov Assumption）

　i. 對於任意長度的單詞序列P(wi|w(i-n) ...w(i−1))是比較難預測的

　ii. 馬爾科夫假設（Markov Assumption）: 第i個單詞wi僅依賴於前n個單詞

　iii. 三元語法模型（又稱二階馬爾科夫模型）:

　　1. P(wi|START,w1,w2,...,w(i−1）)=P(wi|w(i−1),w(i−2))

　　2. P(w1w2...wn)= 　P(w1|S)∗P(w2|S,w1)∗P(w3|w1,w2)∗...P(E|w(n−1),wn)

d) 一種語言計算模型（A Computational Model of Language）

　i. 一種有用的概念和練習裝置:「拋硬幣」模型

　　1. 由隨機算法生成句子

　　——生成器能夠是許多「狀態」中的一個

　　——拋硬幣決定下一個狀態

　　——拋其餘硬幣決定哪個字母或單詞輸出

　ii. 香農（Shannon）: 「The states will correspond to the「residue of influence」 from preceding letters」

e) 基於單詞的逼近

　注：如下是用莎士比亞做品訓練後隨機生成的句子，可參考《天然語言處理綜論》

　i. 一元語法逼近（這裏MIT課件有誤，不是一階逼近（First-order approximation））

　　1. To him swallowed confess hear both. which. OF save

　　2. on trail for are ay device and rote life have

　　3. Every enter now severally so, let

　　4. Hill he late speaks; or! a more to leg less first you

　　5. enter

　ii. 三元語法逼近（這裏課件有誤，不是三階逼近（Third-order approximation））

　　1. King Henry. What! I will go seek the traitor Gloucester.

　　2. Exeunt some of the watch. A great banquet serv’s in;

　　3. Will you tell me how I am?

　　4. It cannot be but so.

3、語言模型的評估

a) 評估一個語言模型

　i. 咱們有n個測試單詞串:

　　　　　S_{1},S_{2},...,S_{n}

　ii. 考慮在咱們模型之下這段單詞串的機率：

　　　　　prod{i=1}{n}{P(S_{i})}

或對數機率(or log probability):

　　log{prod{i=1}{n}{P(S_{i})}}=sum{i=1}{n}{logP(S_{i})}

　iii. 困惑度（Perplexity）:

　　　　　Perplexity = 2^{-x}

　　這裏x = {1/W}sum{i=1}{n}{logP(S_{i})}

　　W是測試數據裏總的單詞數（W is the total number of words in the test data.）

　iv. 困惑度是一種有效的「分支因子」評測方法（Perplexity is a measure of effective 「branching factor」）

　　1. 咱們有一個規模爲N的詞聚集v，模型預測（We have a vocabulary v of size N, and model predicts）：

　　P(w) = 1/N 對於v中全部的單詞（for all the words in v.）

　v. 困惑度是什麼（What about Perplexity）?

　　　　　 Perplexity = 2^{-x}

　　　這裏 x = log{1/N}

　　　因而 Perplexity = N

　vi. 人類行爲的評估（estimate of human performance (Shannon, 1951)

　　1. 香農遊戲（Shannon game）— 人們在一段文本中猜想下一個字母（humans guess next letter in text）

　　2. PP=142(1.3 bits/letter), uncased, open vocabulary

　vii. 三元語言模型的評估（estimate of trigram language model (Brown et al. 1992)）

　　PP=790(1.75 bits/letter), cased, open vocabulary

4、平滑算法

a) 最大似然估計（Maximum Likelihood Estimate）

　i. MLE使訓練數據儘量的「大」：

　　　P_{ML}(w_{i}/{w_{i-1},w_{i-2}}) = {Count(w_{i-2},w_{i-1},w_{i})}/{Count(w_{i-2},w_{i-1})}

　　1. 對於詞彙規模爲N的語料庫，咱們將要在模型中獲得N^{3}的參數（For vocabulary of size N, we will have N3 parameters in the model）

　　2. 對於N=1000，咱們將要估計1000^{3}=10^{9}個參數（For N =1, 000, we have to estimate1, 000^{3}=10^{9} parameters）

　　3. 問題（Problem）: 如何處理未登陸詞?

　ii. 數據稀疏問題（Sparsity）

　　1. 未知事件的總計機率構成了測試數據的很大一部分

　　2. Brown et al (1992): 考慮一個3.5億詞的英語語料庫，14%的三元詞是未知的

　iii. 注：關於MLE的簡要補充

　　1. 最大似然估計是一種統計方法，它用來求一個樣本集的相關機率密度函數的參數。這個方法最先是遺傳學家以及統計學家羅納德•費雪爵士在1912年至1922年間開始使用的。

　　2. 「似然」是對likelihood 的一種較爲貼近文言文的翻譯，「似然」用現代的中文來講即「可能性」。故而，若稱之爲「最大可能性估計」則更加通俗易懂。

　　3.MLE選擇的參數使訓練語料具備最高的機率，它沒有浪費任何機率在訓練語料中沒有出現的事件中

　　4.可是MLE機率模型一般不適合作NLP的統計語言模型，會出現0機率，這是不容許的。

b) 如何估計未知元素的機率?

　i. 打折（Discounting）

　　1. Laplace 加1平滑（Laplace）

　　2. Good-Turing 打折法（Good-Turing）

　ii. 線性插值法（Linear Interpolation）

　iii. Katz回退（Katz Back-Off）

c) 加一(Laplace)平滑

　i. 最簡單的打折方法（Simplest discounting technique）:

　　　{P(w_{i}/w_{i-1})} = {C(w_{i-1},w_{i})+1}/{C(w_{i-1})+V}

　　這裏Ｖ是詞彙表的數目——語料庫的「型」

　　注：MIT課件這裏彷佛有誤，我已修改

　ii. 貝葉斯估計假設事件發生前是一個均勻分佈

　iii. 問題（Problem）: 對於未知事件佔去的機率太多了

　iv. 例子（Example）：

　　假設V=10000(詞型)，S=1000000(詞例)（Assume |ν| =10, 000, and S=1, 000, 000）：

　　　P_{MLE}(ball/{kike~a}) = {{Count(kike~a~ball)}/{Count(kick~a)}} = 9/10 = 0.9

　　　P_{+1}(ball/{kike~a}) = {{Count(kike~a~ball)+1}/{Count(kick~a)+V}} = {9+1}/{10+10000} = 9*10^{-4}

　v. Laplace的缺點（Weaknesses of Laplace）

　　1. 對於稀疏分佈，Laplace法則賦予未知事件太多的機率空間

　　2. 在預測二元語法的實際機率時與其餘平滑方法相比顯得很是差（

　　3. 使用加epsilon平滑更合理一些

5、 Good-Turing打折法（Good-Turing Discounting）

a) 你在未來看到一個新詞的可能性有多大？用所看到的事件去估計未知事件的機率

　i. n_r——頻率爲r的元素（n元語法）計數而且r>0

　ii. n_0——總詞彙規模減去觀察到的詞彙規模，既出現次數爲0的n元語法

　iii. 對於頻率爲r的元素，修正計數爲：

　　　　　　　　r^* = (r+1)*{n_{r+1}/n_r}

b) 關於Good-Turing打折法的補充說明：

　i. Good(1953)首先描述了Good-Turing算法，而這種算法的原創思想則來自Turing 。

　ii. Good-Turing平滑的基本思想是：用觀察較高的N元語法數的方法來從新估計機率量的大小，並把它指派給那些具備零計數或較低計數的N元語法。

c) 直觀的Good-Turing打折法（Good-Turing Discounting: Intuition）

　i. 目的（Goal）: 估計訓練數據中計數爲r的單詞在一樣規模測試集中的出現頻率（estimate how often word with r counts in training data occurs in test set of equal size）。

　ii. 咱們使用刪除估計（We use deleted estimation）：

　　1. 每次刪除一個單詞（delete one word at a time）

　　2. 若是單詞「test」在全部的數據集中出現了r+1次（if 「test」 word occurs r +1 times in complete data set）：

　　——它在訓練集中出現了r 次（it occurs r times in 「training」 set）

　　——對計數爲r的單詞加1（add one count to words with r counts）

　iii. r-count單詞「桶」中的總的計數爲（total count placed to bucket for r-count words is）:

　　　　　　　　　n_{r+1}*(r +1)

　iv. 平均計數爲：

　　　　　　(avg-count of r count words) = {n_{r+1}*(r+1)}/n_r

d) Good-Turing打折法續（Good-Turing Discounting (cont.)）：

　i. 在Good-Turing中，分配給全部未知事件的總的機率等於n_1/N, 其中N是訓練集的規模。它與分配給獨立事件的相對頻率公式類似。

　ii. In Good-Turing, the total probability assigned to all the unobserved events is equal ton_1/N , where N is the size of the training set. It is the same as a relative frequency formula would assign to singleton events.

e) 舉例（Example: Good-Turing）

Training sample of 22,000,000 (Church&Gale’1991))

r 　　　N_r　　　　　　　heldout　　r^*

0 　　74,671,100,000　0.00027　0.00027

1 　　2,018,046　　　　0.448　　0.446

2 　　449,721　　　　　1.25　　　1.26

3 　　188,933　　　　　2.24　　　2.24

4 　　105,668　　　　　3.23　　　3.24

5 　　68,379　　　　　 4.21　　　4.22

6 　　48,190　　　　　 5.23　　　5.19

f) 補充說明：

　i. 根據Zipf定律,對於小的r, N_r比較大;對於大的r,N_r小,對於出現次數最多的n元組,r*=0!

　ii. 因此對於出現次數不少的n元組, GT估計不許,而MLE估計比較準,所以能夠直接採用MLE. GT估計通常適用於出現次數爲k(k<10)的n元組　iii. 若是這樣,考慮」劫富濟貧」,這裏的」富」就變成了」中產」階級!呵呵,真正的富翁沾光了!（雖然富翁損一點也沒什麼）連打折法也不敢欺富人！這就是「爲富不仁」，「愛財如命」的來歷。

6、插值及回退

a) The Bias-Variance Trade-Off

　i. 未平滑的三元模型估計(Unsmoothed trigram estimate)：　　　　　　

　　P_ML({w_i}/{w_{i-2},w_{i-1}})={Count(w_{i-2}w_{i-1}w_{i})}/{Count(w_{i-2},w_{i-1})}

　ii. 未平滑的二元模型估計(Unsmoothed bigram estimate）：

　　　P_ML({w_i}/{w_{i-1}})={Count(w_{i-1}w_{i})}/{Count(w_{i-1})}

　iii. 未平滑的一元模型估計(Unsmoothed unigram estimate)：

　　　P_ML({w_i})={Count(w_{i})}/sum{j}{}{Count(w_{j})}

　iv. 這些不一樣的估計中哪一個和「真實」的P({w_i}/{w_{i-2},w_{i-1}})機率最接近（How close are these different estimates to the 「true」 probability P({w_i}/{w_{i-2},w_{i-1}}))?

b) 插值（Interpolation）

　i. 一種解決三元模型數據稀疏問題的方法是在模型中混合使用受數據稀疏影響較小的二元模型和一元模型（One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness）。

　ii. 權值可使用指望最大化算法（EM）或其它數值優化技術設置（The weights can be set using the Expectation-Maximization Algorithm or another numerical optimization technique）

　iii. 線性插值（Linear Interpolation)

　　hat{P}({w_i}/{w_{i-2},w_{i-1}})={lambda_1}*P_ML({w_i}/{w_{i-2},w_{i-1}})

　　+{lambda_2}*P_ML({w_i}/w_{i-1})+{lambda_3}*P_ML({w_i})

　　這裏{lambda_1}+{lambda_2}+{lambda_3}=1而且{lambda_i}>=0 對於全部的 i

　iv. 參數估計（Parameter Estimation）

　　1. 取出訓練集的一部分做爲「驗證」數據（Hold out part of training set as 「validation」 data）

　　2. 定義Count_2(w_1,w_2,w_3)做爲驗證集中三元集 w_1,w_2,w_3 的出現次數（DefineCount_2(w_1,w_2,w_3) to be the number of times the trigram w_1,w_2,w_3 is seen in validation set）

　　3. 選擇{lambda_i}去最大化(Choose {lambda_i} to maximize):

L({lambda_1},{lambda_2},{lambda_3})=sum{(w_1,w_2,w_3)in{upsilon}}{}{Count_2(w_1,w_2,w_3)}log{hat{P}}({w_3}/{w_2,w_1})

　　這裏{lambda_1}+{lambda_2}+{lambda_3}=1而且{lambda_i}>=0 對於全部的 i

　　注：關於參數估計的其餘內容，因爲公式太多，這裏略，請參考原始課件

c)Kats回退模型-兩元（Katz Back-Off Models (Bigrams)）：

　i. 定義兩個集合（Define two sets）：

　　A(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)>0}{rbrace}

　　B(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)=0}{rbrace}

　ii. 一種兩元模型（A bigram model）：

P_K({w_i}/w_{i-1})=delim{lbrace}{matrix{2}{2}{{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}>0} {if{w_i}{in}{A(w_{i-1})}} {alpha(w_{i-1}){{P_ML(w_{i})}/sum{w{in}B(w_{i-1})}{}{P_ML(w)}} } {if{w_i}{in}{B(w_{i-1})}} }}{}

{alpha(w_{i-1})=1-sum{w{in}A(w_{i-1})}{}{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}}}

　iii. Count^*定義（Count^*definitions）

　　1. Kats對於Count(x)<5使用Good-Turing方法,對於Count(x)>=5令Count^*(x)=Count(x)(Katz uses Good-Turing method for Count(x)< 5, and Count^*(x)=Count(x)for Count(x)>=5)

　　2. 「Kneser-Ney」方法（「Kneser-Ney」 method）：

　　　Count^*(x)=Count(x)-D,其中 D={n_1}/{n_1+n_2}

　　　n_1是頻率爲1的元素個數（n_1 is a number of elements with frequency 1)

　　　n_2是頻率爲2的元素個數（n_2 is a number of elements with frequency 2)

7、綜述

a) N元模型的弱點（Weaknesses of n-gram Models）

　i. 有何想法（Any ideas）?

　　短距離（Short-range）

　　中距離（Mid-range）

　　長距離（Long-range）

b) 更精確的模型（More Refined Models）

　i. 基於類的模型（Class-based models）