word2vec數學分析

    本文沒有繁文縟節,純數學推導,建議先閱讀《word2vec中的數學原理詳解》html

1、邏輯迴歸

    能夠閱讀《邏輯迴歸算法分析》理解邏輯迴歸。web

    sigmoid函數 σ ( x ) = 1 1 + e x \sigma(x) = \frac{1}{1 + e^{-x}} 算法

         σ ( x ) = σ ( x ) [ 1 σ ( x ) ] \sigma'(x) = \sigma(x)[1 - \sigma(x)] app

         [ l o g σ ( x ) ] = σ ( x ) σ ( x ) = 1 σ ( x ) [log\sigma(x)]' = \frac{\sigma'(x)}{\sigma(x)} = 1 - \sigma(x) 機器學習

         [ l o g ( 1 σ ( x ) ) ] = σ ( x ) 1 σ ( x ) = σ ( x ) [log(1 - \sigma(x))]' = \frac{-\sigma'(x)}{1 - \sigma(x)} = - \sigma(x) svg

    邏輯迴歸用於解決二分類問題,定義好極大對數似然函數,採用梯度上升的方法進行優化。事實上,word2vec的算法本質就是邏輯迴歸。函數


2、CBOW

    根據上下文詞,預測當前詞,將預測偏差反向更新到每一個上下文詞上,以達到更準確的預測的目的。學習

    記:優化

    一、 p w p^w :從根節點出發,到達 w w 的路徑;編碼

    二、 l w l^w :路徑 p w p^w 包含節點的個數;

    三、 p 1 w , p 2 w ,   , p l w w p_1^w, p_2^w, \cdots, p_{l^w}^w :路徑 p w p^w 中第 l w l^w 個節點,其中 p 1 w p_1^w 是根節點, p l w w p_{l^w}^w w w 對應的節點

    四、 d 2 w , d 3 w ,   , d l w w   ϵ   { 0 , 1 } d_2^w, d_3^w, \cdots, d_{l^w}^w \ \epsilon \ \left \{ 0, 1 \right \} :詞 w w 的哈夫曼編碼,它由 l w 1 l^w - 1 位編碼構成, d j w d_j^w 表示路徑 p w p^w j j 個節點對應的編碼(根節點不編碼)。

    五、 θ 1 w , θ 2 w ,   , θ l w 1 w   ϵ   R m \theta_1^w, \theta_2^w, \cdots, \theta_{l^w - 1}^w \ \epsilon \ \mathbb{R}^m :路徑 p w p^w 中非葉子節點對應的向量, θ j w \theta_j^w 表示路徑 p w p^w j j 個非葉子節點對應的向量

    六、 L a b e l ( p j w ) = 1 d j w , j = 2 , 3 ,   , l w Label(p_j^w) = 1 - d_j^w, j = 2,3,\cdots,l^w :表示路徑 p w p^w j j 個節點對應的分類標籤(根節點不分類)

一、Hierarchical Softmax

    極大似然 w ϵ C p ( w C o n t e x t ( w ) ) \prod_{w \epsilon C}p(w|Context(w))

    極大對數似然 £ = w ϵ C l o g   p ( w C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w))

    條件機率 p ( w C o n t e x t ( w ) ) = j = 2 l w p ( d j w X w , θ j 1 w ) p(w|Context(w)) = \prod_{j=2}^{l^w} p(d_j^w|X_w, \theta_{j-1}^w) ,其中:

         p ( d j w X w , θ j 1 w ) = { σ ( X w T θ j 1 w ) ,    d j w = 0 1 σ ( X w T θ j 1 w ) , d j w = 1 p(d_j^w|X_w, \theta_{j-1}^w) = \left\{\begin{matrix}\sigma (X_w^T \theta_{j - 1}^w), \quad \quad \ \ d_j^w = 0 & \\ & \\ 1 - \sigma (X_w^T \theta_{j - 1}^w), \quad d_j^w = 1 & \end{matrix}\right. ,注意:在word2vec中的哈夫曼樹中,編碼0表示正類,編碼1表示負類。

         X w = u ϵ C o n t e x t ( w ) v ( u ) C o n t e x t ( w ) X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|}


    寫成總體即: p ( d j w X w , θ j 1 w ) = [ σ ( X w T θ j 1 w ) ] 1 d j w [ 1 σ ( X w T θ j 1 w ) ] d j w p(d_j^w|X_w, \theta_{j-1}^w) = [\sigma (X_w^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (X_w^T \theta_{j - 1}^w)]^{d_j^w} ,代入對數似然函數得:

     £ = w ϵ C l o g j = 2 l w p ( d j w X w , θ j 1 w ) \pounds = \sum _{w \epsilon C} log \prod_{j=2}^{l^w} p(d_j^w|X_w, \theta_{j-1}^w)

        = w ϵ C l o g j = 2 l w { [ σ ( X w T θ j 1 w ) ] 1 d j w [ 1 σ ( X w T θ j 1 w ) ] d j w } = \sum _{w \epsilon C} log \prod_{j=2}^{l^w} \left \{ [\sigma (X_w^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (X_w^T \theta_{j - 1}^w)]^{d_j^w} \right \}

        = w ϵ C j = 2 l w { ( 1 d j w ) l o g [ σ ( X w T θ j 1 w ) ] + d j w l o g [ 1 σ ( X w T θ j 1 w ) ] } = \sum_{w \epsilon C} \sum_{j=2}^{l^w} \left \{ (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)] \right \}


    爲求導方便,記: £ ( w , j ) = ( 1 d j w ) l o g [ σ ( X w T θ j 1 w ) ] + d j w l o g [ 1 σ ( X w T θ j 1 w ) ] \pounds(w, j) = (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)]

     £ ( w , j ) \pounds(w, j) 關於 θ j 1 w \theta_{j - 1}^w 的梯度:

         £ ( w , j ) θ j 1 w = θ j 1 w { ( 1 d j w ) l o g [ σ ( X w T θ j 1 w ) ] + d j w l o g [ 1 σ ( X w T θ j 1 w ) ] } \frac{\partial \pounds(w, j)}{\partial \theta_{j - 1}^w} = \frac{\partial }{\partial \theta_{j - 1}^w}\left \{ (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)] \right \}

                 = ( 1 d j w ) [ 1 σ ( X w T θ j 1 w ) ] X w d j w [ σ ( X w T θ j 1 w ) ] X w = (1 - d_j^w)[1 - \sigma (X_w^T \theta_{j - 1}^w)]X_w - d_j^w [\sigma (X_w^T \theta_{j - 1}^w)]X_w

                 = { ( 1 d j w ) [ 1 σ ( X w T θ j 1 w ) ] d j w [ σ ( X w T θ j 1 w ) ] } X w = \left \{ (1 - d_j^w)[1 - \sigma (X_w^T \theta_{j - 1}^w)] - d_j^w [\sigma (X_w^T \theta_{j - 1}^w)] \right \} X_w

                 = [ 1 d j w σ ( X w T θ j 1 w ) ] X w = [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] X_w

    因而, θ j 1 w \theta_{j - 1}^w 的更新可寫爲:

         θ j 1 w   : =   θ j 1 w + η [ 1 d j w σ ( X w T θ j 1 w ) ] X w \theta_{j - 1}^w \ := \ \theta_{j - 1}^w + \eta [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] X_w


    因爲在 £ ( w , j ) \pounds(w, j) θ j 1 w \theta_{j - 1}^w X w X_w 是對稱的,因此 £ ( w , j ) \pounds(w, j) 關於 X w X_w 的梯度:

         £ ( w , j ) X w = [ 1 d j w σ ( X w T θ j 1 w ) ] θ j 1 w \frac{\partial \pounds(w, j)}{\partial X_w} = [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] \theta_{j - 1}^w

     用 £ ( w , j ) X w \frac{\partial \pounds(w, j)}{\partial X_w} 來對上下文詞 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) 進行更新:

         v ( u ) : = v ( u ) + η j = 2 l w £ ( w , j ) X w v(u) := v(u) + \eta \sum_{j=2}^{l^w}\frac{\partial \pounds(w, j)}{\partial X_w}


    以樣本 ( C o n t e x t ( w ) , w ) (Context(w), w) 爲例,訓練僞代碼以下:

     e = 0 e = 0

     X w = u ϵ C o n t e x t ( w ) v ( u ) C o n t e x t ( w ) X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|}

     F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO
     {
          q = σ ( X w T θ j 1 w ) q = \sigma(X_w^T\theta_{j - 1}^w)

          g = η [ 1 d j w q ] g = \eta [1 - d_j^w - q]

          e : = e + g θ j 1 w e := e + g\theta_{j - 1}^w

          θ j 1 w : = θ j 1 w + g X w \theta_{j - 1}^w := \theta_{j - 1}^w + gX_w
     }

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO
     {
          v ( u ) : = v ( u ) + e v(u) := v(u) + e
     }


    這裏有必要對以上僞代碼的含義作一個說明,固然,直接經過導數推導過程來理解也能夠,但導數的推導過程並無表達其真實的內在含義,下文中有相似的地方,再也不說明。

    一、 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w)

        其含義是在已知上下文的前提下,在當前詞的哈夫曼路徑上作分類預測,根據路徑上的父節點 θ j 1 w \theta_{j - 1}^w ,預測其子節點 θ j w \theta_{j}^w ,獲得的子節點的分類標籤。固然,這裏獲得的分類標籤是[0,1]之間的實數,而不是{0, 1}二分類,這個值與0,1之間的差距便是預測偏差。把 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 理解成子節點 θ j w \theta_{j}^w 正分類的機率也是能夠的。

    二、 1 d j w q 1 - d_j^w - q

         1 d j w 1 - d_j^w 的含義是子節點 θ j w \theta_{j}^w 的真實分類標籤, 1 d j w q 1 - d_j^w - q 則是真實標籤與預測標籤之間的偏差。

    三、 e : = e + g θ j 1 w e := e + g\theta_{j - 1}^w

        這裏是一個關鍵點,回到咱們最開始的優化函數上,要求是的最大對數似然: £ = w ϵ C l o g   p ( w C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w)) ,即求極大值,因此要用梯度上升的方法進行優化(機器學習中通常是梯度降低),因此e的更新是加法(梯度降低是減法)。

        當梯度爲正的時, g θ j 1 w > 0 g\theta_{j - 1}^w > 0 ,則 e : = e + g θ j 1 w e := e + g\theta_{j - 1}^w 相加後變大,將 e e 更新到 v ( u ) v(u) 後讓 X w X_w 變大, X w X_w 變大後 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 也就變大,也就是說預測的分類標籤越像1(正類),也能夠理解成預測爲正類的機率增大。爲何要讓 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 增大呢?反過來思考,當梯度爲正的時, ( 1 d j w q ) > 0 (1 - d_j^w - q) > 0 ,這時只有當 d j w = 0 d_j^w = 0 時其值纔可能爲正,而 d j w = 0 d_j^w = 0 表示正類,分類標籤爲1,因此優化時要讓 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 趨近於1。

        一樣,當梯度爲負的時, g θ j 1 w < 0 g\theta_{j - 1}^w < 0 ,則 e : = e + g θ j 1 w e := e + g\theta_{j - 1}^w 相加後變小,將 e e 更新到 v ( u ) v(u) 後讓 X w X_w 變小, X w X_w 變小後 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 也就變小,也就是說預測的分類標籤越像0(負類),也能夠理解成預測爲正類的機率減少。一樣,反過來思考,當梯度爲負的時, ( 1 d j w q ) < 0 (1 - d_j^w - q) < 0 ,這時只有當 d j w = 1 d_j^w = 1 時其值纔可能爲負,而 d j w = 1 d_j^w = 1 表示負類,分類標籤爲0,因此優化時要讓 σ ( X w T θ j 1 w ) \sigma(X_w^T\theta_{j - 1}^w) 趨近於0。

    四、 θ j 1 w : = θ j 1 w + g X w \theta_{j - 1}^w := \theta_{j - 1}^w + gX_w

        同上

    五、 F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO

        該循環的含義是上下文預測的是葉子節點的詞(當前詞在葉子節點上),要通過該詞的哈夫曼路徑才能到達,因此要循環累計路徑上(除根節點)每一次分類的偏差。

    總結:根據上下文詞,遍歷當前詞的哈夫曼路徑,累計(除根節點之外)每一個節點的二分類偏差,將偏差反向更新到每一個上下文詞上(同時也會更新路徑上節點的輔助向量)。


二、Negative Sampling

    對 w w 的負樣本子集 N E G ( w ) NEG(w) 的每一個樣本,定義樣本標籤:

         L w ( w ~ ) = { 1 , w = w ~ 0 , w w ~ L^w(\tilde{w}) = \left\{\begin{matrix}1,\quad w = \tilde{w} & \\ & \\ 0,\quad w \neq \tilde{w} & \end{matrix}\right.

    極大似然 w ϵ C p ( w C o n t e x t ( w ) ) \prod_{w \epsilon C}p(w|Context(w))

    極大對數似然 £ = w ϵ C l o g   p ( w C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w))

    條件機率 p ( w C o n t e x t ( w ) ) = u ϵ { w } N E G ( w ) p ( u C o n t e x t ( w ) ) p(w|Context(w)) = \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)}p(u|Context(w)) ,其中:

         p ( u C o n t e x t ( w ) ) = { σ ( X w T θ u ) ,    L w ( u ) = 1 1 σ ( X w T θ u ) , L w ( u ) = 0 p(u|Context(w)) = \left\{\begin{matrix}\sigma (X_w^T \theta^u), \quad \quad \ \ L^w(u) = 1 & \\ & \\ 1 - \sigma (X_w^T \theta^u), \quad L^w(u) = 0 & \end{matrix}\right.

         X w = u ϵ C o n t e x t ( w ) v ( u ) C o n t e x t ( w ) X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|}


    寫成總體即: p ( u C o n t e x t ( w ) ) = [ σ ( X w T θ u ) ] L w ( u ) [ 1 σ ( X w T θ u ) ] 1 L w ( u ) p(u|Context(w)) = [\sigma (X_w^T \theta^u)]^{L^w(u)} \cdot [1 - \sigma (X_w^T \theta^u)]^{1 - L^w(u)} ,代入對數似然函數得:

     £ = w ϵ C l o g   u ϵ { w } N E G ( w ) p ( u C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)}p(u|Context(w))

        = w ϵ C l o g   u ϵ { w } N E G ( w ) [ σ ( X w T θ u ) ] L w ( u ) [ 1 σ ( X w T θ u ) ] 1 L w ( u ) = \sum_{w \epsilon C} log \ \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)} [\sigma (X_w^T \theta^u)]^{L^w(u)} \cdot [1 - \sigma (X_w^T \theta^u)]^{1 - L^w(u)}

        = w ϵ C u ϵ { w } N E G ( w ) { L w ( u ) l o g [ σ ( X w T θ u ) ] + [ 1 L w ( u ) ] l o g [ 1 σ ( X w T θ u ) ] } = \sum_{w \epsilon C} \sum_{u \epsilon \left \{ w \right \} \cup NEG(w)} \left \{ L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)] \right \}


    爲求導方便,記: £ ( w , u ) = L w ( u ) l o g [ σ ( X w T θ u ) ] + [ 1 L w ( u ) ] l o g [ 1 σ ( X w T θ u ) ] \pounds(w, u) = L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)]

     £ ( w , u ) \pounds(w, u) 關於 θ u \theta^u 的梯度:

         £ ( w , u ) θ u = θ u { L w ( u ) l o g [ σ ( X w T θ u ) ] + [ 1 L w ( u ) ] l o g [ 1 σ ( X w T θ u ) ] } \frac{\partial \pounds(w, u)}{\partial \theta^u} = \frac{\partial }{\partial \theta^u}\left \{ L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)] \right \}

                 = L w ( u ) [ 1 σ ( X w T θ u ) ] X w [ 1 L w ( u ) ] [ σ ( X w T θ u ) ] X w = L^w(u) \cdot [1 - \sigma (X_w^T \theta^u)]X_w - [1 - L^w(u)] \cdot [\sigma (X_w^T \theta^u)] X_w

                 = { L w ( u ) [ 1 σ ( X w T θ u ) ] [ 1 L w ( u ) ] [ σ ( X w T θ u ) ] } X w = \left \{ L^w(u) \cdot [1 - \sigma (X_w^T \theta^u)] - [1 - L^w(u)] \cdot [\sigma (X_w^T \theta^u)] \right \} X_w

                 = [ L w ( u ) σ ( X w T θ u ) ] X w = [L^w(u) - \sigma (X_w^T \theta^u) ] X_w

    因而, θ u \theta^u 的更新可寫爲:

         θ u   : =   θ u + η [ L w ( u ) σ ( X w T θ u ) ] X w \theta^u \ := \ \theta^u + \eta [L^w(u) - \sigma (X_w^T \theta^u) ] X_w


    因爲在 £ ( w , u ) \pounds(w, u) θ u \theta^u X w X_w 是對稱的,因此 £ ( w , u ) \pounds(w, u) 關於 X w X_w 的梯度:

         £ ( w , u ) X w = [ L w ( u ) σ ( X w T θ u ) ] θ u \frac{\partial \pounds(w, u)}{\partial X_w} = [L^w(u) - \sigma (X_w^T \theta^u) ] \theta^u

     用 £ ( w , u ) X w \frac{\partial \pounds(w, u)}{\partial X_w} 來對上下文詞 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) 進行更新:

         v ( u ) : = v ( u ) + η u ϵ { w } N E G ( w ) £ ( w , u ) X w v(u) := v(u) + \eta \sum_{u \epsilon \left \{ w \right \} \cup NEG(w)}\frac{\partial \pounds(w, u)}{\partial X_w}


    以樣本 ( C o n t e x t ( w ) , w ) (Context(w), w) 爲例,訓練僞代碼以下:

     e = 0 e = 0

     X w = u ϵ C o n t e x t ( w ) v ( u ) C o n t e x t ( w ) X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|}

     F O R u ϵ { w } N E G ( w ) D O FOR \quad u \epsilon \left \{ w \right \} \cup NEG(w) \quad DO
     {
          q = σ ( X w T θ u ) q = \sigma(X_w^T\theta^u)

          g = η [ L w ( u ) q ] g = \eta [L^w(u) - q]

          e : = e + g θ u e := e + g\theta^u

          θ u : = θ u + g X w \theta^u := \theta^u + gX_w
       }

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO
     {
          v ( u ) : = v ( u ) + e v(u) := v(u) + e
     }


    總結:根據上下文詞,對當前詞作一次負採樣(包括當前詞,當前詞是正樣本),遍歷每一個樣本,累計上下文對每一個樣本的預測偏差,將偏差反向更新到每一個上下文詞上(同時也會更新樣本向量)。


3、Skip-gram

    根據當前詞,預測上下文詞,將預測偏差反向更新到當前詞上,以達到更準確的預測的目的。但word2vec並無按這個思路訓練,而是依然按照CBOW的思路,用上下文中的每一個詞(注意這裏的區別,CBOW是合併了上下文,即 u ϵ C o n t e x t ( w ) v ( u ) \sum_{u \epsilon Context(w)}v(u) ),對當前詞進行預測,再將預測偏差反向更新到該上下文詞上。

一、Hierarchical Softmax

    極大似然 w ϵ C p ( C o n t e x t ( w ) w ) \prod_{w \epsilon C}p(Context(w)|w)

    極大對數似然 £ = w ϵ C l o g   p ( C o n t e x t ( w ) w ) \pounds = \sum_{w \epsilon C} log \ p(Context(w)|w)

    條件機率 p ( C o n t e x t ( w ) w ) = u ϵ C o n t e x t ( w ) p ( u w ) p(Context(w)|w) = \prod_{u \epsilon Context(w)} p(u|w) ,其中:

         p ( u w ) = j = 2 l u p ( d j u v ( w ) , θ j 1 u ) p(u|w) = \prod_{j=2}^{l^u} p(d_j^u|v(w), \theta_{j-1}^u)

         p ( d j u v ( w ) , θ j 1 u ) = { σ ( v ( w ) T θ j 1 u ) ,     d j u = 0 1 σ ( v ( w ) T θ j 1 u ) , d j u = 1 p(d_j^u|v(w), \theta_{j-1}^u) = \left\{\begin{matrix}\sigma (v(w)^T \theta_{j - 1}^u), \quad \quad \ \ \ d_j^u = 0 & \\ & \\ 1 - \sigma (v(w)^T \theta_{j - 1}^u), \quad d_j^u = 1 & \end{matrix}\right.


    寫成總體即: p ( d j u v ( w ) , θ j 1 u ) = [ σ ( v ( w ) T θ j 1 u ) ] 1 d j u [ 1 σ ( v ( w ) T θ j 1 u ) ] d j u p(d_j^u|v(w), \theta_{j-1}^u) = [\sigma (v(w)^T \theta_{j - 1}^u)]^{1 - d_j^u} \cdot [1 - \sigma (v(w)^T \theta_{j - 1}^u)]^{d_j^u} ,代入對數似然函數得:

     £ = w ϵ C l o g u ϵ C o n t e x t ( w ) j = 2 l u p ( d j u v ( w ) , θ j 1 u ) \pounds = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{j=2}^{l^u} p(d_j^u|v(w), \theta_{j-1}^u)

        = w ϵ C l o g u ϵ C o n t e x t ( w ) j = 2 l u [ σ ( v ( w ) T θ j 1 u ) ] 1 d j u [ 1 σ ( v ( w ) T θ j 1 u ) ] d j u = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{j=2}^{l^u} [\sigma (v(w)^T \theta_{j - 1}^u)]^{1 - d_j^u} \cdot [1 - \sigma (v(w)^T \theta_{j - 1}^u)]^{d_j^u}

        = w ϵ C u ϵ C o n t e x t ( w ) j = 2 l u { ( 1 d j u ) l o g   [ σ ( v ( w ) T θ j 1 u ) ] + d j u l o g [ 1 σ ( v ( w ) T θ j 1 u ) ] } = \sum _{w \epsilon C} \sum_{u \epsilon Context(w)} \sum_{j=2}^{l^u} \left \{ (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)] \right \}

    爲求導方便,記: £ ( w , u , j ) = ( 1 d j u ) l o g   [ σ ( v ( w ) T θ j 1 u ) ] + d j u l o g [ 1 σ ( v ( w ) T θ j 1 u ) ] \pounds(w, u, j) = (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)]

     £ ( w , u , j ) \pounds(w, u, j) 關於 θ j 1 u \theta_{j - 1}^u 的梯度:

         £ ( w , u , j ) θ j 1 u = θ j 1 u { ( 1 d j u ) l o g   [ σ ( v ( w ) T θ j 1 u ) ] + d j u l o g [ 1 σ ( v ( w ) T θ j 1 u ) ] } \frac{\partial \pounds(w, u, j)}{\partial \theta_{j - 1}^u} = \frac{\partial }{\partial \theta_{j - 1}^u}\left \{ (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)] \right \}

                 = ( 1 d j u ) [ 1 σ ( v ( w ) T θ j 1 u ) ] v ( w ) d j u [ σ ( v ( w ) T θ j 1 u ) ] v ( w ) = (1 - d_j^u)[1 - \sigma (v(w)^T \theta_{j - 1}^u)]v(w) - d_j^u [\sigma (v(w)^T \theta_{j - 1}^u)]v(w)

                 = { ( 1 d j u ) [ 1 σ ( v ( w ) T θ j 1 u ) ] d j u [ σ ( v ( w ) T θ j 1 u ) ] } v ( w ) = \left \{ (1 - d_j^u)[1 - \sigma (v(w)^T \theta_{j - 1}^u)] - d_j^u [\sigma (v(w)^T \theta_{j - 1}^u)] \right \}v(w)

                 = [ 1 d j u σ ( v ( w ) T θ j 1 u ) ] v ( w ) = [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] v(w)

    因而, θ j 1 u \theta_{j - 1}^u 的更新可寫爲:

         θ j 1 u   : =   θ j 1 u + η [ 1 d j u σ ( v ( w ) T θ j 1 u ) ] v ( w ) \theta_{j - 1}^u \ := \ \theta_{j - 1}^u + \eta [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] v(w)

    因爲在 £ ( w , u , j ) \pounds(w, u, j) θ j 1 u \theta_{j - 1}^u v ( w ) v(w) 是對稱的,因此 £ ( w , u , j ) \pounds(w, u, j) 關於 v ( w ) v(w) 的梯度:

         £ ( w , u , j ) v ( w ) = [ 1 d j u σ ( v ( w ) T θ j 1 u ) ] θ j 1 u \frac{\partial \pounds(w, u, j)}{\partial v(w)} = [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] \theta_{j - 1}^u

     用 £ ( w , u , j ) v ( w ) \frac{\partial \pounds(w, u, j)}{\partial v(w)} 來對當前詞 v ( w ) v(w) 進行更新:

         v ( w ) : = v ( w ) + η u ϵ C o n t e x t ( w ) j = 2 l u £ ( w , u , j ) v ( w ) v(w) := v(w) + \eta \sum_{u \epsilon Context(w)} \sum_{j=2}^{l^u}\frac{\partial \pounds(w, u, j)}{\partial v(w)}


    以樣本 ( w , C o n t e x t ( w ) ) (w, Context(w)) 爲例,訓練僞代碼以下:

     e = 0 e = 0

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO
     {
          F O R j = 2 : l u D O FOR \quad j = 2:l^u \quad DO
          {
               q = σ ( v ( w ) T θ j 1 u ) q = \sigma(v(w)^T\theta_{j - 1}^u)

               g = η [ 1 d j u q ] g = \eta [1 - d_j^u - q]

               e : = e + g θ j 1 u e := e + g\theta_{j - 1}^u

               θ j 1 u : = θ j 1 u + g v ( w ) \theta_{j - 1}^u := \theta_{j - 1}^u + gv(w)
          }
     }

     v ( w ) : = v ( w ) + e v(w) := v(w) + e


    值得注意的是,word2vec並非按上面的流程進行訓練的,而依然按CBOW的思路,對每個上下文詞,預測當前詞,分析以下:

    極大似然 w ϵ C u ϵ C o n t e x t ( w ) p ( w u ) \prod_{w \epsilon C}\prod_{u \epsilon Context(w)}p(w|u)

    極大對數似然 £ = w ϵ C u ϵ C o n t e x t ( w ) l o g   p ( w u ) \pounds = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}log \ p(w|u)

    條件機率 p ( w u ) = j = 2 l w p ( d j w v ( u ) , θ j 1 w ) p(w|u) = \prod_{j=2}^{l^w} p(d_j^w|v(u), \theta_{j-1}^w) ,其中:

         p ( d j w v ( u ) , θ j 1 w ) = { σ ( v ( u ) T θ j 1 w ) ,    d j w = 0 1 σ ( v ( u ) T θ j 1 w ) , d j w = 1 p(d_j^w|v(u), \theta_{j-1}^w) = \left\{\begin{matrix}\sigma (v(u)^T \theta_{j - 1}^w), \quad \quad \ \ d_j^w = 0 & \\ & \\ 1 - \sigma (v(u)^T \theta_{j - 1}^w), \quad d_j^w = 1 & \end{matrix}\right.


    寫成總體即: p ( d j w v ( u ) , θ j 1 w ) = [ σ ( v ( u ) T θ j 1 w ) ] 1 d j w [ 1 σ ( v ( u ) T θ j 1 w ) ] d j w p(d_j^w|v(u), \theta_{j-1}^w) = [\sigma (v(u)^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (v(u)^T \theta_{j - 1}^w)]^{d_j^w} ,代入對數似然函數得:

     £ = w ϵ C u ϵ C o n t e x t ( w ) l o g j = 2 l w p ( d j w v ( u ) , θ j 1 w ) \pounds = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)} log \prod_{j=2}^{l^w} p(d_j^w|v(u), \theta_{j-1}^w)

        = w ϵ C u ϵ C o n t e x t ( w ) l o g j = 2 l w [ σ ( v ( u ) T θ j 1 w ) ] 1 d j w [ 1 σ ( v ( u ) T θ j 1 w ) ] d j w = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)} log \prod_{j=2}^{l^w} [\sigma (v(u)^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (v(u)^T \theta_{j - 1}^w)]^{d_j^w}

        = w ϵ C u ϵ C o n t e x t ( w ) j = 2 l w { ( 1 d j w ) l o g [ σ ( v ( u ) T θ j 1 w ) ] + d j w l o g [ 1 σ ( v ( u ) T θ j 1 w ) ] } = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}\sum_{j=2}^{l^w} \left \{ (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)] \right \}


    爲求導方便,記: £ ( w , u , j ) = ( 1 d j w ) l o g [ σ ( v ( u ) T θ j 1 w ) ] + d j w l o g [ 1 σ ( v ( u ) T θ j 1 w ) ] \pounds(w, u, j) = (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)]

     £ ( w , u , j ) \pounds(w, u, j) 關於 θ j 1 w \theta_{j - 1}^w 的梯度:

         £ ( w , u , j ) θ j 1 w = θ j 1 w { ( 1 d j w ) l o g [ σ ( v ( u ) T θ j 1 w ) ] + d j w l o g [ 1 σ ( v ( u ) T θ j 1 w ) ] } \frac{\partial \pounds(w, u, j)}{\partial \theta_{j - 1}^w} = \frac{\partial }{\partial \theta_{j - 1}^w}\left \{ (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)] \right \}

                 = ( 1 d j w ) [ 1 σ ( v ( u ) T θ j 1 w ) ] v ( u ) d j w [ σ ( v ( u ) T θ j 1 w ) ] v ( u ) = (1 - d_j^w)[1 - \sigma (v(u)^T \theta_{j - 1}^w)]v(u) - d_j^w [\sigma (v(u)^T \theta_{j - 1}^w)]v(u)

                 = { ( 1 d j w ) [ 1 σ ( v ( u ) T θ j 1 w ) ] d j w [ σ ( v ( u ) T θ j 1 w ) ] } v ( u ) = \left \{ (1 - d_j^w)[1 - \sigma (v(u)^T \theta_{j - 1}^w)] - d_j^w [\sigma (v(u)^T \theta_{j - 1}^w)] \right \} v(u)

                 = [ 1 d j w σ ( v ( u ) T θ j 1 w ) ] v ( u ) = [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] v(u)

    因而, θ j 1 w \theta_{j - 1}^w 的更新可寫爲:

         θ j 1 w   : =   θ j 1 w + η [ 1 d j w σ ( v ( u ) T θ j 1 w ) ] v ( u ) \theta_{j - 1}^w \ := \ \theta_{j - 1}^w + \eta [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] v(u)


    因爲在 £ ( w , u , j ) \pounds(w, u, j) θ j 1 w \theta_{j - 1}^w v ( u ) v(u) 是對稱的,因此 £ ( w , u , j ) \pounds(w, u, j) 關於 v ( u ) v(u) 的梯度:

         £ ( w , u , j ) v ( u ) = [ 1 d j w σ ( v ( u ) T θ j 1 w ) ] θ j 1 w \frac{\partial \pounds(w, u, j)}{\partial v(u)} = [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] \theta_{j - 1}^w

     用 £ ( w , u , j ) v ( u ) \frac{\partial \pounds(w, u, j)}{\partial v(u)} 來對上下文詞 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) 進行更新:

         v ( u ) : = v ( u ) + η j = 2 l w £ ( w , u , j ) v ( u ) v(u) := v(u) + \eta \sum_{j=2}^{l^w}\frac{\partial \pounds(w, u, j)}{\partial v(u)}


    以樣本 ( w , C o n t e x t ( w ) ) (w, Context(w)) 爲例,訓練僞代碼以下:

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO
     {
          e = 0 e = 0

          F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO
          {
               q = σ ( v ( u ) T θ j 1 w ) q = \sigma(v(u)^T\theta_{j - 1}^w)

               g = η [ 1 d j w q ] g = \eta [1 - d_j^w - q]

               e : = e + g θ j 1 w e := e + g\theta_{j - 1}^w

               θ j 1 w : = θ j 1 w + g v ( u ) \theta_{j - 1}^w := \theta_{j - 1}^w + gv(u)
          }

          v ( u ) : = v ( u ) + e v(u) := v(u) + e
     }


    總結:根據上下文詞(用該上下文詞來預測當前詞),遍歷當前詞的哈夫曼路徑,累計(除根節點之外)每一個節點的二分類偏差,將偏差反向更新到該上下文詞上(同時也會更新路徑上節點的輔助向量)。


二、Negative Sampling

    對 w w 的負樣本子集 N E G ( w ) NEG(w) 的每一個樣本,定義樣本標籤:

         L w ( w ~ ) = { 1 , w = w ~ 0 , w w ~ L^w(\tilde{w}) = \left\{\begin{matrix}1,\quad w = \tilde{w} & \\ & \\ 0,\quad w \neq \tilde{w} & \end{matrix}\right.

    極大似然 w ϵ C p ( C o n t e x t ( w ) w ) \prod_{w \epsilon C}p(Context(w)|w)

    極大對數似然 £ = w ϵ C l o g   p ( C o n t e x t ( w ) w ) \pounds = \sum_{w \epsilon C} log \ p(Context(w)|w)

    條件機率 p ( C o n t e x t ( w ) w ) = u ϵ C o n t e x t ( w ) p ( u w ) p(Context(w)|w) = \prod_{u \epsilon Context(w)} p(u|w) ,其中:

         p ( u w ) = z ϵ { u } N E G ( u ) p ( z w ) p(u|w) = \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)} p(z|w)

         p ( z w ) = { σ ( v ( w ) T θ z ) ,    L u ( z ) = 1 1 σ ( v ( w ) T θ z ) , L u ( z ) = 0 p(z|w) = \left\{\begin{matrix}\sigma (v(w)^T \theta^z), \quad \quad \ \ L^u(z) = 1 & \\ & \\ 1 - \sigma (v(w)^T \theta^z), \quad L^u(z) = 0 & \end{matrix}\right.


    寫成總體即: p ( z w ) = [ σ ( v ( w ) T θ z ) ] L u ( z ) [ 1 σ ( v ( w ) T θ z ) ] 1 L u ( z ) p(z|w) = [\sigma (v(w)^T \theta^z)]^{L^u(z)} \cdot [1 - \sigma (v(w)^T \theta^z)]^{1 - L^u(z)} ,代入對數似然函數得:

     £ = w ϵ C l o g u ϵ C o n t e x t ( w ) z ϵ { u } N E G ( u ) p ( z w ) \pounds = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)} p(z|w)

        = w ϵ C l o g u ϵ C o n t e x t ( w ) z ϵ { u } N E G ( u ) [ σ ( v ( w ) T θ z ) ] L u ( z ) [ 1 σ ( v ( w ) T θ z ) ] 1 L u ( z ) = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)}[\sigma (v(w)^T \theta^z)]^{L^u(z)} \cdot [1 - \sigma (v(w)^T \theta^z)]^{1 - L^u(z)}

        = w ϵ C u ϵ C o n t e x t ( w ) z ϵ { u } N E G ( u ) { L u ( z ) l o g [ σ ( v ( w ) T θ z ) ] + [ 1 L u ( z ) ] l o g [ 1 σ ( v ( w ) T θ z ) ] } = \sum _{w \epsilon C} \sum_{u \epsilon Context(w)} \sum_{z \epsilon \left \{ u \right \} \cup NEG(u)} \left \{ L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)] \right \}


    爲求導方便,記: £ ( w , u , z ) = L u ( z ) l o g [ σ ( v ( w ) T θ z ) ] + [ 1 L u ( z ) ] l o g [ 1 σ ( v ( w ) T θ z ) ] \pounds(w, u, z) = L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)]

     £ ( w , u , z ) \pounds(w, u, z) 關於 θ z \theta^z 的梯度:

         £ ( w , u , z ) θ z = θ z { L u ( z ) l o g [ σ ( v ( w ) T θ z ) ] + [ 1 L u ( z ) ] l o g [ 1 σ ( v ( w ) T θ z ) ] } \frac{\partial \pounds(w, u, z)}{\partial \theta^z} = \frac{\partial }{\partial \theta^z}\left \{ L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)] \right \}

                 = L u ( z ) [ 1 σ ( v ( w ) T <

相關文章
相關標籤/搜索