Tensorflow Seq2seq attention decode解析

時間 2019-12-05

標籤 tensorflow seq2seq seq attention decode 解析欄目 HTTP/TCP 简体版

原文原文鏈接

tensorflow基於 Grammar as a Foreign Language實現，這篇論文給出的公式也比較清楚。數組

這裏關注seq2seq.attention_decode函數， app

主要輸入

decoder_inputs, 函數

initial_state, spa

attention_states, 3d

這裏能夠主要參考 models/textsum的應用，textsum採用的多層雙向lstm， code

假設只有一層，texsum將正向最後輸出的state做爲 attention_decode的輸入initial_state blog

(不過不少論文認爲用逆向最後的state可能效果更好) ci

對應decocer_inputs就是標註的摘要的字符序列id對應查找到的embedding序列 get

而attention_states是正向負向輸出concatenate的全部outputs（hidden注意output和hidden是等同概念) input

關於linear

首先注意到在attention_decode函數用到了一個linear這個定義在rnn_cell._linear函數

他的輸入是一個list 可能的輸入是好比

[ [batch_size, lenght1], [batch_size_length2]]

對應一個list 2個數組

它的做用是內部定義一個數組對應這個例子 [length1 + length2, output_size]

也就是起到將[batch_size, length1][batch_size, length2]的序列輸入映射到 [batch_size, output_size]的輸出

這個在attention機制最後會遇到

先看attention的公式

將encoder的hidden states表示爲

(h 1 , . . . , h T A)

將decoder的hidden states表示爲

(d 1 , . . . , d T B) := (h T A +1 , . . . , h T A +T B).

這裏最後計算獲得的

就是attention的結果對應一個樣本就是長度爲 atten_size的向量(就是全部attention輸入向量按照第三個公式的線性疊加以後的結果)那麼對應batch_size的輸入就是[batch_size, atten_size]的一個結果。

論文中提到後面會用到這個attention，

也就是說會concat attention的結果和原始hidden state的結果，那麼如何使用呢，tf的作法

x = linear([inp] + attns, input_size, True)

# Run the RNN.

cell_output, state = cell(x, state)

就是說 inp是 [batch_size, input_size], attns [batch_size, attn_size] linear的輸入對應 input_size

即在linear內部通過input和attns concate以後輸出[batch_size, input_size]使得可以x做爲輸入繼續進行rnn過程

attention公式
繼續看attention公式，不要考慮batch_size就是按照一個樣原本考慮
第一個公式對應3個舉止 W1,W2都是[attn_size, atten_size]的正方形矩陣，h,d對應 [attent_size, 1]的向量
v對應[atten_size, 1]的矩陣，
那麼就是線性疊加以後作非線性變化tanh([attn_size, 1])->[attn_size, 1]最後和v作dot獲得一個數值表示u(i,t)
即對應第i個attention向量在decode的t時刻時候應該的權重大小，
第二個公式表示使用softmax作歸一化獲得權重向量機率大小。
第三個公式上面已經分析。
tensorflow中attention的實現
- 步驟1
這裏第一個問題是咱們按照batch操做因此對應處理的不是一個樣本而是一批batch_size個樣本。
那麼上面的操做就不能按照tf.matmul來執行了，由於[batch_size, x, y][y, 1]這樣相乘是不行的
tf的作法是使用1by1 convolution來完成，主要利用1by1 + num_channels + num_filters
關於conv2d的使用特別是配合1by1，num_channels, num_filters 這裏解釋的很是清楚
http://stackoverflow.com/questions/34619177/what-does-tf-nn-conv2d-do-in-tensorflow

# To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
hidden = array_ops.reshape(
attention_states, [-1, attn_length, 1, attn_size])
hidden_features = []
v = []
attention_vec_size = attn_size # Size of query vectors for attention.
for a in xrange(num_heads):
k = variable_scope.get_variable("AttnW_%d" % a,
[1, 1, attn_size, attention_vec_size])
hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
v.append(
variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))