一文搞懂NLP中的Attention機制（附詳細代碼講解）

機器學習算法與天然語言處理出品
@公衆號原創專欄做者 Don.hub
單位 | 京東算法工程師
學校 | 帝國理工大學

算法

Outline
Intuition
Analysis
Pros
Cons
From Seq2Seq To Attention Model
seq2seq 很重要，可是缺陷也很明顯
attention was born
Write the encoder and decoder model
Taxonomy of attention
number of sequence
distinctive
co-attention
self
number of abstraction
single-level
multi-level
number of positions
soft/global
hard
local
number of representations
multi-representational
multi-dimensional
summary
Networks with Attention
encoder-decoder
CNN/RNN + RNN
Pointer Networks
Transformer
Memory Networks
Applications
NLG
Classification
Recommendation Systems
ref
1. Outline

2. Intuition

吸睛這個詞就很表明attention，咱們在看一張圖片的時候，很容易被更重要或者更突出的東西所吸引，因此咱們把更多的注意放在局部的部分上，在計算機視覺（CV）領域，就能夠看做是圖片的局部擁有更多的權重，好比圖片生成標題，標題中的詞就會主要聚焦於局部。

數據庫

NLP領域，能夠想象咱們在作閱讀理解的時候，咱們在看文章的時候，每每是帶着問題去尋找答案，因此文章中的每一個部分是須要不一樣的注意力的。例如咱們在作評論情感分析的時候，一些特定的情感詞，例如amazing等，咱們每每須要特別注意，由於它們是很重要的情感詞，每每決定了評論者的情感。以下圖（Yang et al., 何老師團隊 HAN數組

直白地說，attention就是一個權重的vector。
網絡

3. Analysis

3.1 Pros

attention的好處主要是具備很好的解釋性，而且極大的提升了模型的效果，已是不少SOTA 模型必備的模塊，特別是transformer（使用了self / global/ multi-level/ multihead/ attention）的出現極大得改變了NLP的格局。app

3.2 Cons

無法捕捉位置信息，須要添加位置信息。固然不一樣的attention機制有不一樣的固然若是說transformer的壞處，其最大的壞處是空間消耗大，這是由於咱們須要儲存attention score（N*N）的維度，因此Sequence length（N）不能太長，這就致使，咱們seq和seq之間沒有關聯。（具體參照XLNET以及XLNET的解決方式）機器學習

3.3 From Seq2Seq To Attention Model

爲何會有attention？attention其實就是爲了翻譯任務而生的（但最後又不侷限於翻譯任務），咱們來看看他的具體演化。ide

3.3.1 seq2seq 很重要，可是缺陷也很明顯

Seq2Seq model 是有encoder和decoder組成的，它主要的目的是將輸入的文字翻譯成目標文字。其中encoder和decoder都是RNN，（能夠是RNN/LSTM/或者GRU或者是雙向RNN）。模型將source的文字編碼成一串固定長度的context編碼，以後利用這段編碼，使用decoder解碼出具體的輸出target。這種轉化任務能夠適用於：翻譯，語音轉化，對話生成等序列到序列的任務。函數

可是這種模型的缺點也很明顯：- 首先全部的輸入都編碼成一個固定長度的context vector，這個長度多少合適呢？很難有個確切的答案，一個固定長度的vector並不能編碼全部的上下文信息，致使的是咱們不少的長距離依賴關係信息都消失了。- decoder在生成輸出的時候，沒有一個與encoder的輸入的匹配機制，對於不一樣的輸入進行不一樣權重的關注。- Second, it is unable to model alignment between input and output sequences, which is an essential aspect of structured output tasks such as translation or summarization [Young et al., 2018]. Intuitively, in sequence-to-sequence tasks, each output token is expected to be more inﬂuenced by some speciﬁc parts of the input sequence. However, decoder lacks any mechanism to selectively focus on relevant input tokens while generating each output token.學習

3.3.2 attention was born

NMT【paper】【code】最先提出了在encoder以及decoder之間追加attention block，最主要就是解決encoder 以及decoder之間匹配問題。測試

其中是decoder的初始化hidden state，是隨機初始化的，相比於seq2seq（他是用context vector做爲decoder的hidden 初始化），是decoder的hidden states。
表明的是第j個encoder位置的輸出hidden states
表明的是第i個decoder的位置對對j個encoder位置的權重
是第i個decoder的位置的輸出，就是通過hidden state輸出以後再通過全鏈接層的輸出
表明的是第i個decoder的context vector，其實輸出hidden output的加權求和
decoder的輸入是由自身的hidden state以及這兩個的concat結果

3.3.3 Write the encoder and decoder model

詳細的實現能夠參照tensorflow的repo使用的是tf1.x Neural Machine Translation (seq2seq) tutorial. 這裏的代碼用的是最新的2.x的代碼 code.

輸入通過encoder以後獲得的hidden states 的形狀爲 (batch_size, max_length, hidden_size) ， decoder的 hidden state 形狀爲 (batch_size, hidden_size).

如下是被implement的等式：

This tutorial uses Bahdanau attention for the encoder. Let's decide on notation before writing the simplified form:

FC = Fully connected (dense) layer
EO = Encoder output
H = hidden state
X = input to the decoder

And the pseudo-code:

score = FC(tanh(FC(EO) + FC(H)))
attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.
embedding output = The input to the decoder X is passed through an embedding layer.
merged vector = concat(embedding output, context vector)
This merged vector is then given to the GRU

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

4. Taxonomy of attention

根據不一樣的分類標準，能夠將attention分爲多個類別，可是具體來講都是q（query）k（key）以及v（value）之間的交互，經過q以及k計算score，這個score的計算方法各有不一樣以下表，再通過softmax進行歸一化。最後在將計算出來的score於v相乘加和（或者取argmax 參見pointer network）。

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

(*) Referred to as 「concat」 in Luong, et al., 2015 and as 「additive attention」 in Vaswani, et al., 2017. (^) It adds a scaling factor 1/n‾√1/n, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

如下的分類不是互斥的，好比說HAN模型，就是一個multi-level，soft，的attention model（AM）。

4.1 number of sequence

根據咱們的query以及value來自的sequence來分類。

4.1.1 distinctive

attention的query和value分別來自不一樣兩個不一樣的input sequence和output sequence，例如咱們上文提到的NMT，咱們的query來自於decoder的hidden state，咱們的value來自去encoder的hidden state。

4.1.2 co-attention

co-attention 模型對多個輸入sequences進行聯合學習權重，而且捕獲這些輸入的交互做用。例如visual question answering 任務中，做者認爲對於圖片進行attention重要，可是對於問題文本進行attention也一樣重要，因此做者採用了聯合學習的方式，運用attention使得模型可以同時捕獲重要的題幹信息以及對應的圖片信息。

4.1.3 self

例如文本分類或者推薦系統，咱們的輸入是一個序列，輸出不是序列，這種場景下，文本中的每一個詞，就去看與自身序列相關的詞的重要程度關聯。以下圖

咱們能夠看看bert的self attention的實現的函數說明，其中若是from tensor= to tensor，那就是self attention

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.
  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the **same**, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector

"""

4.2 number of abstraction

這是根據attention計算權重的層級來劃分的。

4.2.1 single-level

在最多見的case中，attention都是在輸入的sequence上面進行計算的，這就是普通的single-level attention。

4.2.2 multi-level

可是也有不少模型，例如HAN，模型結構以下。模型是hierarchical的結構的，它的attention也是做用在多層結構上的。咱們介紹一下這個模型的做用，它主要作的是一個文檔分類的問題，他提出，文檔是由句子組成的，句子又是由字組成的，因此他就搭建了兩級的encoder（雙向GRU）表示，底下的encoder編碼字，上面的encoder編碼句子。在兩個encoder之間，鏈接了attention層，這個attention層是編碼字層級上的注意力。在最後輸出做文本分類的時候，也使用了一個句子層級上的attention，最後輸出來Dense進行句子分類。須要注意的是，這裏的兩個query 以及都是隨機初始化，而後跟着模型一塊兒訓練的，score方法用的也是Dense方法，可是這邊和NMT不一樣的是，他是self attention。

4.3 number of positions

根據attention 層關注的位置不一樣，咱們能夠把attention分爲三類，分別是global/soft（這兩個幾乎同樣），local以及hard attention。Effective Approaches to Attention-based Neural Machine Translation. 提出了local global attention，Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 提出了hard soft attention

4.3.1 soft/global

global/soft attention 指的是attention 的位置爲輸入序列的全部位置，好處在與平滑可微，可是壞處是計算量大。

4.3.2 hard

hard attention 的context vector是從採樣出來的輸入序列hidden states進行計算的，至關於將hidden states進行隨機選擇，而後計算attention。這樣子能夠減小計算量，可是帶來的壞處就是計算不可微，須要採用強化學習或者其餘技巧例如variational learning methods。

4.3.3 local

local的方式是hard和soft的折中 - 首先從input sequence中找到一個須要attention的點或者位置 - 在選擇一個窗口大小，create一個local的soft attention 這樣作的好處在於，計算是可微的，而且減小了計算量

4.4 number of representations

一般來講single-representation是最多見的狀況，which means 一個輸入只有一種特徵表示。可是在其餘場景中，一個輸入可能有多種表達，咱們按輸入的representation方式分類。

4.4.1 multi-representational

在一些場景中，一種特徵表示不足以徹底捕獲輸入的全部信息，輸入特徵能夠進行多種特徵表示，例如Show, attend and tell: Neural image caption generation with visual attention. 這篇論文就對文本輸入進行了多種的word embedding表示，而後最後對這些表示進行attention的權重加和。再好比，一個文本輸入分別詞，語法，視覺，類別維度的embedding表示，最後對這些表示進行attention的權重加和。

4.4.2 multi-dimensional

顧名思義，這種attention跟維度有關。這種attention的權重能夠決定輸入的embedding向量中不一樣維度之間的相關性。其實embedding中的維度能夠看做一種隱性的特徵表示（不像one_hot那種顯性表示直觀，雖然缺乏可解釋性，可是也算是特徵的隱性表示），因此經過計算不一樣維度的相關性就能找出起做用最大的特徵維度。尤爲是解決一詞多義時，這種方式很是有效果。因此，這種方法在句子級的embedding表示、NLU中都是頗有用的。

5. summary

6. Networks with Attention

介紹了那麼多的attention類別，那麼attention一般是運用在什麼網絡上的呢，咱們這邊總結了兩種網絡，一種是encoder-decoder based的一種是memory network。

6.1 encoder-decoder

encoder-decoder網絡+attention是最多見的+attention的網絡，其中NMT是第一個提出attention思想的網絡。這邊的encoder和decoder是能夠靈活改變的，並不絕對都是RNN結構。

6.1.1 CNN/RNN + RNN

對於圖片轉文字這種任務，能夠將encoder換成CNN，文字轉文字的任務可使用RNN+RNN。

6.1.2 Pointer Networks

並非全部的序列輸入和序列輸出的問題均可以使用encoder-decoder模型解決，(e.g. sorting or travelling salesman problem). 例以下面這個問題：咱們想要找到一堆的點，可以將圖內全部的點包圍起來。咱們指望獲得的效果是，輸入全部的點最後輸出的是

若是直接下去訓練的話，下圖所示：input 4個data point的座標，獲得一個紅色的vector，再把vector放到decoder中去，獲得distribution，再作sample（好比作argmax，決定要輸出token 1...），最終看看work不work，結果是不work。好比：訓練的時候有50 個點，編號1-50，可是測試的時候有100個點，可是它只能選擇 1-50編號的點，後面的點就選不了了。

改進：attention，可讓network動態的決定輸出的set有多大

x0，y0表明END這些詞，每個input都會獲得一個attention的weight=output的distribution。

最後的模型的結束的條件就是點的機率最高

6.1.3 Transformer

transformer網絡使用的是encoder+decoder網絡，其主要是解決了RNN的計算速度慢的問題，經過並行的self attention機制，提升了計算效率。可是與此同時也帶來了計算量大，空間消耗過大的問題，致使sequence length長度不能過長的問題，解決參考transformerXL。（以後會寫一篇關於transformer的文章） - multihead的做用：有點相似與CNN的kernel，主要捕獲不一樣的特徵信息

6.2 Memory Networks

像是question answering，或者聊天機器人等應用，都須要傳入query以及知識數據庫。End-to-end memory networks.經過一個memroy blocks數組儲存知識數據庫，而後經過attention來匹配query和答案。memory network包含四部份內容：query（輸入）的向量、一系列可訓練的map矩陣、attention權重和、多hop推理。這樣就可使用KB中的fact、使用history中的關鍵信息、使用query的關鍵信息等進行推理，這在QA和對話中相當重要。（這裏須要補充）

7. Applications

7.1 NLG

MT：計算機翻譯
QA：problems have made use of attention to (i) better understand questions by focusing on relevant parts of the question [Hermann et al., 2015], (ii) store large amount of information using memory networks to help ﬁnd answers [Sukhbaatar et al., 2015], and (iii) improve performance in visual QA task by modeling multi-modality in input using co-attention [Lu et al., 2016].
Multimedia Description（MD）：is the task of generating a natural language text description of a multimedia input sequence which can be speech, image and video [Cho et al., 2015]. Similar to QA, here attention performs the function of ﬁnding relevant acoustic signals in speech input [Chorowski et al., 2015] or relevant parts of the input image [Xu et al., 2015] to predict the next word in caption. Further, Li et al. [2017] exploit the temporal and spatial structures of videos using multi-level attention for video captioning task. The lower abstraction level extracts speciﬁc regions within a frame and higher abstraction level focuses on small subset of frames selectively.

7.2 Classification

Document classification：HAN
Sentiment Analysis：
Similarly, in the sentiment analysis task, self attention helps to focus on the words that are important for determining the sentiment of input. A couple of approaches for aspect based sentiment classiﬁcation by Wang et al. [2016] and Ma et al. [2018] incorporate additional knowledge of aspect related concepts into the model and use attention to appropriately weigh the concepts apart from the content itself. Sentiment analysis application has also seen multiple architectures being used with attention such as memory networks [Tang et al., 2016] and Transformer [Ambartsoumian and Popowich, 2018; Song et al., 2019].

7.3 Recommendation Systems

Multiple papers use self attention mechanism for ﬁnding the most relevant items in user’s history to improve item recommendations either with collaborative ﬁltering framework [He et al., 2018; Shuai Yu, 2019], or within an encoderdecoder architecture for sequential recommendations [Kang and McAuley, 2018; Zhou et al., 2018].

Recently attention has been used in novel ways which has opened new avenues for research. Some interesting directions include smoother incorporation of external knowledge bases, pre-training embeddings and multi-task learning, unsupervised representational learning, sparsity learning and prototypical learning i.e. sample selection.

8. ref

寫做風格很好，最後模型那塊能夠再補充到本篇文章
很是好的綜述An Attentive Survey of Attention Models
wildml.com/2016/01/atte
圖文詳解NMT（decoder那邊有點錯誤，由於decoder的初始化的embedding 是估計是定義不通，而後初始化的用的是encoder的hidden output做爲attention score的key，而後實際上是concat context和embedding做爲輸入）
NMT代碼
pointer network
pointer slides
All Attention You Need還沒看完