tensorflow 筆記13：瞭解機器翻譯，google NMT，Attention

時間 2019-11-06

標籤 tensorflow 筆記瞭解機器翻譯 google nmt attention 欄目 Google 简体版

原文原文鏈接

1、關於Attention，關於NMTpython

未完待續、、、git

以google 的 nmt 代碼引入探討下端到端：github

項目地址：https://github.com/tensorflow/nmtapp

機器翻譯算是深度學習在垂直領域應用最成功的之一了，深度學習在垂直領域的應用的確能解決不少以前繁瑣的問題，可是缺少範化能力不足，這也是各大公司一直解決的問題；框架

最近開源的模型：函數

lingvo：一種新的側重於sequence2sequence的框架；學習

bert ：一種基於深度雙向Transform的語言模型預訓練策略；this

端到端的解決方案，依然是目前不少NLP任務中經常使用的模型框架；google

2、tensorflow 中的attention：編碼

代碼主要在：https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py

tensorflow 中主要有兩種Attention：

一、Bahdanau 的 Attention

二、Luong 的 Attention

兩種的計算以下所示：

分別來自兩篇NMT的論文也是nmt 最經典的兩篇論文：（深扒的話仍是看論文吧）

一、Bahdanau 的 Attention

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

https://arxiv.org/pdf/1409.0473.pdf

二、Luong 的 Attention

Effective Approaches to Attention-based Neural Machine Translation

https://arxiv.org/pdf/1508.04025.pdf

如下是兩篇論文中如何使用Attention：

　　　　　　　　　　　　　　　　　　　　　　　　　　　圖：兩個attention

二者的區別：

主要區別在於如何評估當前解碼器輸入和編碼器輸出之間的類似性。

tensorflow代碼中封裝好的，共有四個attention函數：

一、加入了得分偏置 bias 的 Bahdanau 的 attention

　　class BahdanauMonotonicAttention（）

二、無得分偏置的Bahdanau 的 attention

　　class BahdanauAttention（）

三、加入了得分偏置 bias 的Luong 的 attention

　　class LuongMonotonicAttention（）

四、無得分偏置的Luong 的 attention

　　class LuongAttention（）

貼一個直接封裝好的代碼encode 和 decoder 的代碼：詳細代碼稍後續上

主要用到有如下幾個函數：attention + beamsearch

tf.contrib.seq2seq.tile_batch

tf.contrib.seq2seq.LuongAttention

tf.contrib.seq2seq.BahdanauAttention

tf.contrib.seq2seq.AttentionWrapper

tf.contrib.seq2seq.TrainingHelper

代碼片斷：

def decoder(mode,encoder_outputs,encoder_state,X_len,word2id_tar,embeddings_Y,embedded_Y):
    k_initializer = tf.contrib.layers.xavier_initializer()
    with tf.variable_scope('decoder'):
        net_mode  = hp.dec_mode
        beam_width = hp.beam_size
        batch_size = hp.batch_size
        memory = encoder_outputs
        num_layers = hp.dec_num_layers
            
        if mode == 'infer':
            memory = tf.contrib.seq2seq.tile_batch(memory, beam_width)
            X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width)
            encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width)
            bs = batch_size * beam_width
        else:
            bs = batch_size
            
        attention = tf.contrib.seq2seq.LuongAttention(hp.dec_hidden_size, memory, X_len, scale=True) # multiplicative
        # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive
        cell = multi_cells(num_layers * 2,mode,net_mode)
        cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hp.dec_hidden_size, name='attention')
        decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state)
            
        with tf.variable_scope('projected'):
            output_layer = tf.layers.Dense(len(word2id_tar), use_bias=False, kernel_initializer=k_initializer)
            
        if mode == 'infer':
            start = tf.fill([batch_size], word2id_tar['<s>'])
            decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_tar['</s>'],
                                                           decoder_initial_state, beam_width, output_layer)
            outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
                                                                                output_time_major=True,
                                                                                maximum_iterations=1 * tf.reduce_max(X_len))
            sample_id = outputs.predicted_ids
            print ("sample_id shape")
            print (sample_id.get_shape())
            return "",sample_id
        else:
            helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [hp.maxlen - 1 for b in range(batch_size)])
            decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer)
                
            outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, 
                                                                                output_time_major=True)
            logits = outputs.rnn_output
            logits = tf.transpose(logits, (1, 0, 2)) 
            print(logits)
            return logits,""

貼一下 google nmt 的代碼：google 裏面寫的也很詳細了

https://www.tensorflow.org/alpha/tutorials/sequences/nmt_with_attention#write_the_encoder_and_decoder_model

主要三部分： attention，encoder，decoder，計算方式如上圖，流程如如下代碼所描述；

#兩個 attention代碼：依據的是： 上圖：兩個attention

class LuongAttentionAttention(tf.keras.Model):
  def __init__(self, units):
    super(LuongAttention, self).__init__()
    self.W = tf.keras.layers.Dense(units)
  
  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, hidden_size)
    #矩陣轉置 轉置前：[batch_size，max_length，hidden_size] 轉置後：[batch_size,hidden_size,max_length]
    score = tf.transpose(values, perm=[0, 2, 1])*self.W(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights

#BahdanauAttention:#計算 attention

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
  
  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, hidden_size)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    
    return context_vector, attention_weights


decoder 的部分代碼
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    
    # 參數簡說：
    self.gru = tf.keras.layers.GRU(self.dec_units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    #調用 attention 函數，傳入，上個時刻的 hidden 和 encoder 的 outputs
    # context_vector 加權平均後的 Ci（論文中的），attention_weights 權重值
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    
    # context_vector 和 embedding 後的 X 進行結合
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    # 此時的 output 應該 等於 state；
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)
    
    # 輸出 outputs 全鏈接以後的 x，隱藏層的state，attention 的score，x在訓練的時候直接做爲損失；

    return x, state, attention_weights


#訓練的部分代碼：
def train_step(inp, targ, enc_hidden):
  loss = 0
        
  with tf.GradientTape() as tape:
    # encoder 部分的代碼，直接取的全部的輸出和最後的隱藏層；
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)       

    # Teacher forcing - feeding the target as the next input
    #按照句子的長度一個一個的進行輸入；
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      # 得到decoder 每一時刻的輸出 和隱藏層的輸出；
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))
  
  return batch_loss

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。