[NLP] 相對位置編碼(一) Relative Position Representatitons (RPR) - Transformer

時間 2019-11-29

標籤 nlp 相對位置編碼 relative position representatitons rpr transformer 欄目字符編碼简体版

原文原文鏈接

對於Transformer模型的positional encoding，最初在Attention is all you need的文章中提出的是進行絕對位置編碼，以後Shaw在2018年的文章中提出了相對位置編碼，就是本篇blog所介紹的算法RPR；2019年的Transformer-XL針對其segment的特定，引入了全局偏置信息，改進了相對位置編碼的算法，在相對位置編碼(二)的blog中介紹。 html

本文參考連接：git

1. Self-Attention with Relative Position Representations (Shaw et al.2018): https://arxiv.org/pdf/1803.02155.pdf github

2. Attention is all you need (Vaswani et al.2017): https://arxiv.org/pdf/1706.03762.pdf算法

3. How Self-Attention with Relative Position Representations works: https://medium.com/@_init_/how-self-attention-with-relative-position-representations-works-28173b8c245a學習

4. [NLP] 相對位置編碼(二) Relative Positional Encodings - Transformer-XL: http://www.javashuo.com/article/p-urbnffxy-dn.html編碼

Motivation

RNN中，第一個"I"與第二個"I"的輸出表徵不一樣，由於用於生成這兩個單詞的hidden states是不一樣的。對於第一個"I"，其hidden state是初始化的狀態；對於第二個"I"，其hidden state是編碼了"I think therefore"的hidden state。因此RNN的hidden state 保證了在同一個輸入序列中，不一樣位置的一樣的單詞的output representation是不一樣的。spa

在self-attention中，第一個"I"與第二個"I"的輸出將徹底相同。由於它們用於產生輸出的「input」是徹底相同的。即在同一個輸入序列中，不一樣位置的相同的單詞的output representation徹底相同，這樣就不能提現單詞之間的時序關係。--因此要對單詞的時序位置進行編碼表徵。翻譯

概述

做者提出了在Transformer模型中加入可訓練的embedding編碼，使得output representatino能夠表徵inputs的時序信息。這些embedding vectors是在計算輸入序列中的任意兩個單詞$i, j$ 之間的attention weight 和 value時被加入到其中。embedding vector用於表示單詞$i,j$之間的距離(即爲間隔的單詞數),因此命名爲"相對位置表徵" (Relative Position Representation) (RPR)3d

好比一個長度爲5的序列，須要學習9個embeddings。(1個表示當前單詞，4個表示其左邊的單詞，4個表示其右邊的單詞。)code

如下例子展現了這些embeddings的用法:

以上圖示顯示了計算第一個"I"的output representation的過程。箭頭下面的數字顯示了計算attention時用到的哪一個RPRs.(好比，本示例是求第一個「I」的輸出，須要用第一個「I」,記爲''I_1'，與sequence中每個單詞兩兩作self-attention運算。'I_1' with 'I_1'用到 index = 4 的RPR，「I_1」with 'think'用到index = 5 的RPR--由於是右邊第一個， 'I_1' with 'therefore' 用到index = 6的RPR--由於是右邊第二個... )

與(1)同理。

符號含義

兩點須要注意：

1. 有2個RPR的表徵。須要在計算$z_i$和$e_{ij}$時分別引入對應的RPR的embedding。計算$z_i$時對應的RPR vector 是$a_{ij}^V$, 計算$e_{ij}時引入的RPR vector$是$a_{ij}^K$. 不一樣於在作multi-head attention時引入的線性映射矩陣W——對於每一個head都不一樣；這個RPR embedding 在同一層的attention heads之間共享，可是在不一樣層的RPR可能不一樣。

2. 最大單詞數被clipped在一個絕對的值k之內。向左k個, 再左邊均爲0，向右k個，再右邊均爲k, 所表示的index範圍： 2k + 1.

eg. 10 words, k = 3, RPR embedding lookup table

設置k值截斷的意義：

1. 做者假設精確的相對位置編碼在超出了必定距離以後是沒有必要的

2. 截斷最大距離使得模型的泛化效果好，能夠更好的Generalize到沒有在訓練階段出現過的序列長度上。

以後，將分別學習key, value的相對位置表徵。

$$w^{K} = (w_{-k}^K, ..., w_{k} ^K), w^{V} = (w_{-k}^V, ..., w_{k} ^V)$$

其中$w_i^K, w_i^V \in \mathbb{R}^{d_a}$.

實現

1. 若不使用RPR, 計算$z_i$的過程：

2. 若使用RPR，計算$z_i$的過程：

(3) 表示在計算word i 的output representation時，對於word j的value vector進行了修改，加上了word i, j 之間的相對位置編碼。

(4) 在計算query(i), key(j)的點積時，對key vector進行了修改，加上了word i, j 之間的相對位置編碼。

這裏用加法引入RPR的信息，是一種高效的實現方式。

高效實現

不加RPR時，Transformer計算$e_{ij}$使用了 batch_size * h 個並行的矩陣乘法運算。

其中的x是給定input sequence後的(row-wise)

將(4) 式寫爲如下形式：

(1) 首先看第一項，$$x_iW^Q(x_jW^K)^T$$

首先看對於一個batch,的一個head，其中$x_i$的shape是(seq_length, dx)，如今假設seq_length = 1,來簡化推導過程。假設$W^Q, W^K$的shape均爲(dx, dz)，那麼第一項運算後的shape爲：[(1 * dx) * (dx, dz)] * [(dz, dx) * (dx, 1)] = (1, 1)，

這是對於一個batch,一個head, seq_length = 1的狀況，那麼擴充到真實的狀況，其shape 爲： (batch_size, h, seq_length, seq_length)

因此咱們的目標是產生另外一個有相同shape的tensor，其內容是word i 與關於Wordi, j 的RPR的embedding的點積。

(2) A.shape: (seq_length, seq_length, d_a)，

$transpose \rightarrow A^T.shape: $(seq_length, d_a, seq_length)

(3) 第二項中的$x_i W^Q.shape:$ (batch_size, h, seq_length, d_z)

$transpose \rightarrow $ (seq_length, batch_size, h, d_z)

$reshape \rightarrow $ (seq_length, batch_size * h, d_z)

以後能夠與$A^T$相乘，能夠看作是seq_length個並行的(batch_size * h, d_z) matmul (d_a, seq_length)，由於$d_z = d_a$，因此每一個並行的運算結果是：(batch_size * h, seq_length), 總的大矩陣的shape: (seq_length, batchsize * h, seq_length).

$reshape \rightarrow $(seq_length, batch_size, h, seq_length)

$transpose \rightarrow$ (batch_size, h, seq_length, seq_length)

與第一項的shape一致，能夠相加。

(3)式的推導同理。

下面給出tensor2tensor中對於相對位置編碼的代碼：https://github.com/tensorflow/tensor2tensor/blob/9e0a894034d8090892c238df1bd9bd3180c2b9a3/tensor2tensor/layers/common_attention.py#L1556-L1587

其中x,對應上面推導中的$x_i * W^Q$, y對應上面推導中的$x_j * W^K$, z對應上面的a。

 1 def _relative_attention_inner(x, y, z, transpose):
 2   """Relative position-aware dot-product attention inner calculation.
 3   This batches matrix multiply calculations to avoid unnecessary broadcasting.
 4   Args:
 5     x: Tensor with shape [batch_size, heads, length or 1, length or depth].
 6     y: Tensor with shape [batch_size, heads, length or 1, depth].
 7     z: Tensor with shape [length or 1, length, depth].
 8     transpose: Whether to transpose inner matrices of y and z. Should be true if
 9         last dimension of x is depth, not length.
10   Returns:
11     A Tensor with shape [batch_size, heads, length, length or depth].
12   """
13   batch_size = tf.shape(x)[0]
14   heads = x.get_shape().as_list()[1]
15   length = tf.shape(x)[2]
16 
17   # xy_matmul is [batch_size, heads, length or 1, length or depth]
18   xy_matmul = tf.matmul(x, y, transpose_b=transpose)
19   # x_t is [length or 1, batch_size, heads, length or depth]
20   x_t = tf.transpose(x, [2, 0, 1, 3])
21   # x_t_r is [length or 1, batch_size * heads, length or depth]
22   x_t_r = tf.reshape(x_t, [length, heads * batch_size, -1])
23   # x_tz_matmul is [length or 1, batch_size * heads, length or depth]
24   x_tz_matmul = tf.matmul(x_t_r, z, transpose_b=transpose)
25   # x_tz_matmul_r is [length or 1, batch_size, heads, length or depth]
26   x_tz_matmul_r = tf.reshape(x_tz_matmul, [length, batch_size, heads, -1])
27   # x_tz_matmul_r_t is [batch_size, heads, length or 1, length or depth]
28   x_tz_matmul_r_t = tf.transpose(x_tz_matmul_r, [1, 2, 0, 3])
29   return xy_matmul + x_tz_matmul_r_t