Attention原理及TensorFlow AttentionWrapper源碼解析

時間 2021-01-24

標籤 python 網絡數據結構架構 app 框架 ide 函數學習優化欄目 Python 简体版

原文原文鏈接

本節來詳細說明一下 Seq2Seq 模型中一個很是有用的 Attention 的機制，並結合 TensorFlow 中的 AttentionWrapper 來剖析一下其代碼實現。
python

Seq2Seq

首先來簡單說明一下 Seq2Seq 模型，若是搞過深度學習，想必必定據說過 Seq2Seq 模型，Seq2Seq 其實就是 Sequence to Sequence，也簡稱 S2S，也能夠稱之爲 Encoder-Decoder 模型，這個模型的核心就是編碼器（Encoder）和解碼器（Decoder）組成的，架構雛形是在 2014 年由論文 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho et al 提出的，後來 Sequence to Sequence Learning with Neural Networks, Sutskever et al 算是比較正式地提出了 Sequence to Sequence 的架構，後來 Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al 又提出了 Attention 機制，將 Seq2Seq 模型推上神壇，並橫掃了很是多的任務，如今也很是普遍地用於機器翻譯、對話生成、文本摘要生成等各類任務上，並取得了很是好的效果。網絡

下面的圖示意了 Seq2Seq 模型的基本架構：數據結構

能夠看到圖中有一箇中間狀態c向量，在c向量左側的咱們能夠稱之爲編碼器（Encoder），編碼器這裏示意的是 RNN 序列，另外 RNN 單元還可使用 LSTM、GRU 等變體，在編碼器下方輸入了，表明模型的輸入內容，例如在翻譯模型中能夠分別表明「我愛中國」這四個字，這樣通過序列處理，它就會獲得最後的輸出，咱們將其表示爲c向量，這樣編碼器的工做就完成了。在圖中c向量的右側部分咱們能夠稱之爲解碼器（Decoder），它拿到編碼器生成的c向量，而後再進行序列解碼，獲得輸出結果，例如剛纔輸入的「我愛中國」四個字便被解碼成了「I love China」，這樣就實現了翻譯任務，以上就是最基本的 Seq2Seq 模型原理。架構

另外還有一種變體，c向量在每次解碼的時候都會做爲解碼器的輸入，其實原理都是相似的，如圖所示：app

這種模型架構是通用的，因此它的適用場景也很是普遍。如機器翻譯、對話生成、文本摘要、閱讀理解、語音識別，也能夠用在一些趣味場景中，如詩詞生成、對聯生成、代碼生成、評論生成等等，效果都很不錯。框架

Attention

經過上圖咱們能夠發現，Encoder 把全部的輸入序列編碼成了一個c向量，而後使用c向量來進行解碼，所以，c向量中必須包含了原始序列中的全部信息，因此它的壓力實際上是很大的，並且因爲 RNN 容易把前面的信息「忘記」掉，因此基本的 Seq2Seq 模型，對於較短的輸入來講，效果仍是能夠接受的，可是在輸入序列比較長的時候，c向量存不下那麼多信息，就會致使生成效果大大折扣。ide

Attention 機制解決了這個問題，它可使得在輸入文本長的時候精確率也不會有明顯降低，它是怎麼作的呢？既然一個c向量存不了，那麼就引入多個c向量，稱之爲，在解碼的時候，這裏的i對應着 Decoder 的解碼位次，每次解碼就利用對應的向量來解碼，如圖所示：函數

這裏的每一個向量其實包含了當前所輸出與輸入序列各個部分重要性的相關的信息。不一樣的向量裏面包含的輸入信息各部分的權重是不一樣的，先放一個示意圖：學習

仍是上面的例子，例如輸入信息是「我愛中國」，輸出的的理想結果應該是「I love China」，在解碼的時候，應該首先須要解碼出「I」這個字符，這時候會用到向量，而向量包含的信息中，「我」這個字的重要性更大，所以它便傾向解碼輸出「I」，當解碼第二個字的時候，會用到向量，而向量包含的信息中，「愛」這個字的重要性更大，所以會解碼輸出「lve」，在解碼第三個字的時候，會用到向量，而向量包含的信息中，」中國」這兩個字的權重都比較大，所以會解碼輸出「China」。因此其實，Attention 注意力機制中的向量記錄了不一樣解碼時刻應該更關注於哪部分輸入數據，也實現了編碼解碼過程的對齊。通過實驗發現，這種機制能夠有效解決輸入信息過長時致使信息解碼效果不理想的問題，另外解碼生成效果同時也有提高。優化

下面咱們以 Bahdanau 提出的 Attention 爲例來詳細剖析一下 Attention 機制。

在沒有引入 Attention 以前，Decoder 在某個時刻解碼的時候其實是依賴於三個部分的，首先咱們知道 RNN 中，每次輸出結果會依賴於隱層和輸入，在 Seq2Seq 模型中，還須要依賴於c向量，因此這裏咱們設在i時刻，解碼器解碼的內容是，上一次解碼結果是，隱層輸出是，因此它們知足這樣的關係：

同時和還知足這樣的關係：

即每次的隱層輸出是上一個隱層和上一個輸出結果和c向量共同計算得出的。

可是剛纔說了，這樣會帶來一些問題，c 向量不足以包含輸入內容的全部信息，尤爲是在輸入序列特別長的狀況下，因此這裏咱們再也不使用一個c向量，而是每個解碼過程對應一個向量，因此公式改寫以下：

同時的計算方式也變爲以下公式：

因此，這裏每次解碼得出時，都有與之對應的向量。那麼這個向量又是怎麼來的呢？實際上它是由編碼器端每一個時刻的隱含狀態加權平均獲得的，這裏假設編碼器端的的序列長度爲，序列位次用j來表示，編碼器段每一個時刻的隱含狀態即爲，對於解碼器的第i時刻，對應的表示以下：

編碼器輸出的結果中，中包含了輸入序列中的第j個詞及前面的一些信息，若是是用了雙向 RNN 的話，則包含的是第j個詞即先後的一些詞的信息，這裏表明了分配的權重，這表明在生成第i個結果的時候，對於輸入信息的各個階段的的注意力分配是不一樣的。當的值越高，表示第i個輸出在第j個輸入上分配的注意力越多，這樣就會致使在生成第i個輸出的時候，受第j個輸入的影響也就越大。

那麼又是怎麼得來的呢？其實它就又關係到第i-1個輸出隱藏狀態以及輸入中的各個隱含狀態，公式表示以下：

同時又表示爲：

這也就是說，這個權重就是和分別計算獲得一個數值，而後再過一個 softmax 函數獲得的，結果就是。

所以就能夠表示爲：

以上即是整個 Attention 機制的推導過程。

TensorFlow AttentionWrapper

咱們瞭解了基本原理，但真正離程序實現出來其實仍是有很大差距的，接下來咱們就結合 TensorFlow 框架來了解一下 Attention 的實現機制。

在 TensorFlow 中，Attention 的相關實現代碼是在 tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py 文件中，這裏面實現了兩種 Attention 機制，分別是 BahdanauAttention 和 LuongAttention，其實現論文分別以下：

Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau, et al
Effective Approaches to Attention-based Neural Machine Translation, Luong, et al

整個 attention_wrapper.py 文件中主要包含幾個類，咱們主要關注其中幾個：

AttentionMechanism、_BaseAttentionMechanism、LuongAttention、BahdanauAttention 實現了 Attention 機制的邏輯。
- AttentionMechanism 是 Attention 類的父類，繼承了 object 類，內部沒有任何實現。
- _BaseAttentionMechanism 繼承自 AttentionMechanism 類，定義了 Attention 機制的一些公共方法實現和屬性。
- LuongAttention、BahdanauAttention 均繼承 _BaseAttentionMechanism 類，分別實現了上面兩篇論文的 Attention 機制。
AttentionWrapperState 用來存儲整個計算過程當中的 state，和 RNN 中的 state 相似，只不過這裏額外還存儲了 attention、time 等信息。
AttentionWrapper 主要用於對封裝 RNNCell，繼承自 RNNCell，封裝後依然是 RNNCell 的實例，能夠構建一個帶有 Attention 機制的 Decoder。
另外還有一些公共方法，例如 hardmax、safe_cumpord 等。

下面咱們以 BahdanauAttention 爲例來講明 Attention 機制及 AttentionWrapper 的實現。

BathdanauAttention

首先咱們來介紹 BahdanauAttention 類的具體原理。

首先咱們來看下它的初始化方法：

def __init__(self,
num_units,
memory,
memory_sequence_length=None,
normalize=False,
probability_fn=None,
score_mask_value=None,
dtype=None,
name="BahdanauAttention"):

這裏一共接受八個參數，下面一一進行說明：

numunits：神經元節點數，咱們知道在計算的時候，須要使用和來進行計算，而兩者的維度可能並非統一的，須要進行變換和統一，因此這裏就有了Wa和Ua這兩個係數，因此在代碼中就是用 num_units 來聲明瞭一個全鏈接 Dense 網絡，用於統一兩者的維度，以便於下一步的計算：

query_layer=layers_core.Dense(num_units, name="query_layer", use_bias=False, dtype=dtype)
memory_layer=layers_core.Dense(num_units, name="memory_layer", use_bias=False, dtype=dtype)

這裏咱們能夠看到聲明瞭一個 querylayer 和 memory_layer，分別和及作全鏈接變換，統一維度。

memory：The memory to query; usually the output of an RNN encoder. 即解碼時用到的上文信息，維度須要是 [batch_size, max_time, context_dim]。這時咱們觀察一下父類 _BaseAttentionMechanism 的初始化方法，實現以下：

with ops.name_scope(
name, "BaseAttentionMechanismInit", nest.flatten(memory)):
self._values = _prepare_memory(
memory, memory_sequence_length,
check_inner_dims_defined=check_inner_dims_defined)
self._keys = (
self.memory_layer(self._values) if self.memory_layer
else self._values)

這裏經過 _prepare_memory() 方法對 memory 進行處理，而後調用 memory_layer 對 memory 進行全鏈接維度變換，變換成 [batch_size, max_time, num_units]。

memory_sequence_length：Sequence lengths for the batch entries in memory. 即 memory 變量的長度信息，相似於 dynamic_rnn 中的 sequence_length，被 _prepare_memory() 方法調用處理 memory 變量，進行 mask 操做：

seq_len_mask = array_ops.sequence_mask(
memory_sequence_length,
maxlen=array_ops.shape(nest.flatten(memory)[0])[1],
dtype=nest.flatten(memory)[0].dtype)
seq_len_batch_size = (
memory_sequence_length.shape[0].value
or array_ops.shape(memory_sequence_length)[0])

normalize：Whether to normalize the energy term. 便是否要實現標準化，方法出自論文：Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, Salimans, et al。
probability_fn：A callable function which converts the score to probabilities. 計算機率時的函數，必須是一個可調用的函數，默認使用 softmax()，還能夠指定 hardmax() 等函數。
score_mask_value：The mask value for score before passing into probability_fn. The default is -inf. Only used if memory_sequence_length is not None. 在使用 probability_fn 計算機率以前，對 score 預先進行 mask 使用的值，默認是負無窮。但這個只有在 memory_sequence_length 參數定義的時候有效。
dtype：The data type for the query and memory layers of the attention mechanism. 數據類型，默認是 float32。
name：Name to use when creating ops，自定義名稱。

接下來類裏面定義了一個 __call__() 方法：

def __call__(self, query, previous_alignments):
with variable_scope.variable_scope(None, "bahdanau_attention", [query]):
processed_query = self.query_layer(query) if self.query_layer else query
score = _bahdanau_score(processed_query, self._keys, self._normalize)
alignments = self._probability_fn(score, previous_alignments)
return alignments

這裏首先定義了 processedquery，這裏也是經過 query_layer 過了一個全鏈接網絡，將最後一維統一成 num_units，而後調用了 _bahdanau_score() 方法，這個方法是比較重要的，主要用來計算公式中的，傳入的參數是 processedquery 以及上文中說起的 _keys 變量，兩者一個表明了，一個表明了，_bahdanau_score() 方法實現以下：

def _bahdanau_score(processed_query, keys, normalize):
dtype = processed_query.dtype
# Get the number of hidden units from the trailing dimension of keys
num_units = keys.shape[2].value or array_ops.shape(keys)[2]
# Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting.
processed_query = array_ops.expand_dims(processed_query, 1)
v = variable_scope.get_variable(
"attention_v", [num_units], dtype=dtype)
if normalize:
# Scalar used in weight normalization
g = variable_scope.get_variable(
"attention_g", dtype=dtype,
initializer=math.sqrt((1. / num_units)))
# Bias added prior to the nonlinearity
b = variable_scope.get_variable(
"attention_b", [num_units], dtype=dtype,
initializer=init_ops.zeros_initializer())
# normed_v = g * v / ||v||
normed_v = g * v * math_ops.rsqrt(
math_ops.reduce_sum(math_ops.square(v)))
return math_ops.reduce_sum(normed_v * math_ops.tanh(keys + processed_query + b), [2])
else:
return math_ops.reduce_sum(v * math_ops.tanh(keys + processed_query), [2])

這裏其實就是實現了 keys 和 processedquery 的加和，若是指定了 normalize 的話還須要進行額外的 normalize，結果就是公式中的，在 TensorFlow 中經常使用 score 變量表示。

接下來再回到 __call__() 方法中，這裏獲得了 score 變量，接下來能夠對齊求 softmax() 操做，獲得：

alignments = self._probability_fn(score, previous_alignments)

這就表明了在i時刻，Decoder 的時候對 Encoder 獲得的每一個的權重大小比例，在 TensorFlow 中經常使用 alignments 變量表示。

因此綜上所述，BahdanauAttention 就是初始化時傳入 num_units 以及 Encoder Outputs，而後調時傳入 query 用便可獲得權重變量 alignments。

AttentionWrapperState

接下來咱們再看下 AttentionWrapperState 這個類，這個類其實比較簡單，就是定義了 Attention 過程當中可能須要保存的變量，如 cell_state、attention、time、alignments 等內容，同時也便於後期的可視化呈現，代碼實現以下：

class AttentionWrapperState(
collections.namedtuple("AttentionWrapperState",
("cell_state", "attention", "time", "alignments",
"alignment_history"))):

可見它就是繼承了 namedtuple 這個數據結構，其實整個 AttentionWrapperState 就像聲明瞭一個結構體，能夠傳入須要的字段生成這個對象。

AttentionWrapper

瞭解了 Attention 機制及 BahdanauAttention 的原理以後，最後咱們再來了解一下 AttentionWrapper，可能你用過不少其餘的 Wrapper，如 DropoutWrapper、ResidualWrapper 等等，它們其實都是 RNNCell 的實例，其實 AttentionWrapper 也不例外，它對 RNNCell 進行了封裝，封裝後依然仍是 RNNCell 的實例。一個普通的 RNN 模型，你要加入 Attention，只須要在 RNNCell 外面套一層 AttentionWrapper 並指定 AttentionMechanism 的實例就行了。並且若是要更換 AttentionMechanism，只須要改變 AttentionWrapper 的參數就行了，這可謂對 Attention 的實現架構徹底解耦，配置很是靈活，TF 大法好！

接下來咱們首先來看下它的初始化方法，其參數是這樣的：

def __init__(self,
cell,
attention_mechanism,
attention_layer_size=None,
alignment_history=False,
cell_input_fn=None,
output_attention=True,
initial_cell_state=None,
name=None):

下面對參數進行一一說明：

cell：An instance of RNNCell. RNNCell 的實例，這裏能夠是單個的 RNNCell，也能夠是多個 RNNCell 組成的 MultiRNNCell。
attention_mechanism：即 AttentionMechanism 的實例，如 BahdanauAttention 對象，另外能夠是多個 AttentionMechanism 組成的列表。
attention_layer_size：是數字或者數字作成的列表，若是是 None（默認），直接使用加權計算後獲得的 Attention 做爲輸出，若是不是 None，那麼 Attention 結果還會和 Output 進行拼接並作線性變換再輸出。其代碼實現以下：

if attention_layer_size is not None:
attention_layer_sizes = tuple(attention_layer_size if isinstance(attention_layer_size, (list, tuple)) else (attention_layer_size,))
if len(attention_layer_sizes) != len(attention_mechanisms):
raise ValueError("If provided, attention_layer_size must contain exactly one integer per attention_mechanism, saw: %d vs %d" % (len(attention_layer_sizes), len(attention_mechanisms)))
self._attention_layers = tuple(layers_core.Dense(attention_layer_size, name="attention_layer", use_bias=False, dtype=attention_mechanisms[i].dtype) for i, attention_layer_size in enumerate(attention_layer_sizes))
self._attention_layer_size = sum(attention_layer_sizes)
else:
self._attention_layers = None
self._attention_layer_size = sum(attention_mechanism.values.get_shape()[-1].value for attention_mechanism in attention_mechanisms)

for i, attention_mechanism in enumerate(self._attention_mechanisms):
attention, alignments = _compute_attention(attention_mechanism, cell_output, previous_alignments[i], self._attention_layers[i] if self._attention_layers else None)
alignment_history = previous_alignment_history[i].write(state.time, alignments) if self._alignment_history else ()

alignment_history：便是否將以前的 alignments 存儲到 state 中，以便於後期進行可視化展現。
cell_input_fn：將 Input 進行處理的方式，默認會將上一步的 Attention 進行拼接操做，以避免形成重複關注一樣的內容。代碼調用以下：

cell_inputs = self._cell_input_fn(inputs, state.attention)

output_attention：是否將 Attention 返回，若是是 False 則返回 Output，不然返回 Attention，默認是 True。
initial_cell_state：計算時的初始狀態。
name：自定義名稱。

AttentionWrapper 的核心方法在它的 call() 方法，即相似於 RNNCell 的 call() 方法，AttentionWrapper 類對其進行了重載，代碼實現以下：

def call(self, inputs, state):
# Step 1
cell_inputs = self._cell_input_fn(inputs, state.attention)
# Step 2
cell_state = state.cell_state
cell_output, next_cell_state = self._cell(cell_inputs, cell_state)
# Step 3
if self._is_multi:
previous_alignments = state.alignments
previous_alignment_history = state.alignment_history
else:
previous_alignments = [state.alignments]
previous_alignment_history = [state.alignment_history]
all_alignments = []
all_attentions = []
all_histories = []
for i, attention_mechanism in enumerate(self._attention_mechanisms):
attention, alignments = _compute_attention(attention_mechanism, cell_output, previous_alignments[i], self._attention_layers[i] if self._attention_layers else None)
alignment_history = previous_alignment_history[i].write(state.time, alignments) if self._alignment_history else ()
all_alignments.append(alignments)
all_histories.append(alignment_history)
all_attentions.append(attention)
# Step 4
attention = array_ops.concat(all_attentions, 1)
# Step 5
next_state = AttentionWrapperState(
time=state.time + 1,
cell_state=next_cell_state,
attention=attention,
alignments=self._item_or_tuple(all_alignments),
alignment_history=self._item_or_tuple(all_histories))
# Step 6
if self._output_attention:
return attention, next_state
else:
return cell_output, next_state

在這裏將一些異常判斷代碼去除了，以便於結構看得更清晰。

首先在第一步中，調用了 _cell_input_fn() 方法，對 inputs 和 state.attention 變量進行處理，默認是使用 concat() 函數拼接，做爲當前時間步的輸入。由於可能前一步的 Attention 可能對當前 Attention 有幫助，以避免讓模型連續兩次將注意力放在同一個地方。

在第二步中，其實就是調用了普通的 RNNCell 的 call() 方法，獲得輸出和下一步的狀態。

第三步中，這時獲得的輸出其實並無用上 AttentionMechanism 中的 alignments 信息，因此當前的輸出信息中咱們並無跟 Encoder 的信息作 Attention，因此這裏還須要調用 _compute_attention() 方法進行權重的計算，其方法實現以下：

def _compute_attention(attention_mechanism, cell_output, previous_alignments, attention_layer):
alignments = attention_mechanism(cell_output, previous_alignments=previous_alignments)
expanded_alignments = array_ops.expand_dims(alignments, 1)
context = math_ops.matmul(expanded_alignments, attention_mechanism.values)
context = array_ops.squeeze(context, [1])
if attention_layer is not None:
attention = attention_layer(array_ops.concat([cell_output, context], 1))
else:
attention = context
return attention, alignments

這個方法接收四個參數，其中 attentionmechanism 就是 AttentionMechanism 的實例，cell_output 就是當前 Output，previous_alignments 是上步的 alignments 信息，調用 attention_mechanism 計算以後就會獲得當前步的 alignments 信息了，即。接下來再利用 alignments 信息進行加權運算，獲得 attention 信息，即，最後將兩者返回。

在第四步中，就是將 attention 結果每一個時間步進行 concat，獲得 attention vector。

第五步中，聲明 AttentionWrapperState 做爲下一步的狀態。

第六步，判斷是否要輸出 Attention，若是是，輸出 Attention 及下一步狀態，不然輸出 Outputs 及下一步狀態。

好，以上即是整個 AttentionWrapper 源碼解析過程，瞭解了源碼以後，再作模型優化的話就很是駕輕就熟了。