A Neural Probabilistic Language Model (2003)論文要點

時間 2019-11-09

標籤 neural probabilistic language model 論文要點简体版

原文原文鏈接

論文連接：http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf算法

解決n-gram語言模型（好比tri-gram以上）的組合爆炸問題，引入詞的分佈式表示。數組

經過使得類似上下文和類似句子中詞的向量彼此接近，所以獲得泛化性。網絡

相對而言考慮了n-gram沒有的更多的上下文和詞之間的類似度。分佈式

使用淺層網絡（好比1層隱層）訓練大語料。函數

feature vector維度一般在100之內，對比詞典大小一般在17000以上。blog

C是全局共享的向量數組。get

最大化正則log似然函數：it

非歸一化的log似然：io

hidden units num = hast

word feature vector dimension = m

context window width = n

output biases b: |V|

hidden layer biases d: h

hidden to output weights U: |V|*h

word feature vector to output weights W: |V|*(n-1)*m

hidden layer weights H: h*(n-1)*m

word reature vector group C: |V|*m

Note that in theory, if there is a weight decay on the weights W and H but not on C, then W and H could converge towards zero while C would blow up. In practice we did not observe such behavior when training with stochastic gradient ascent.

每次訓練大部分參數不須要更新。

訓練算法：

可改進點：

1. 分紅子網絡並行訓練

2. 輸出詞典|V|改爲樹結構，預測每層的條件機率：計算量|V| -> log|V|

3. 梯度重視特別的樣本，好比含有歧義詞的樣本

4. 引入先驗知識（詞性等）

5. 可解釋性

6. 一詞多義（一個詞有多個詞向量）

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。