1、關於Attention,關於NMTpython
未完待續、、、git
以google 的 nmt 代碼引入 探討下端到端:github
項目地址:https://github.com/tensorflow/nmtapp
機器翻譯算是深度學習在垂直領域應用最成功的之一了,深度學習在垂直領域的應用的確能解決不少以前繁瑣的問題,可是缺少範化能力不足,這也是各大公司一直解決的問題;框架
最近開源的模型:函數
lingvo:一種新的側重於sequence2sequence的框架;學習
bert :一種基於深度雙向Transform的語言模型預訓練策略;this
端到端的解決方案,依然是目前不少NLP任務中經常使用的模型框架;google
2、tensorflow 中的attention:編碼
tensorflow 中主要有兩種Attention:
一、Bahdanau 的 Attention
二、Luong 的 Attention
兩種的計算以下所示:
分別來自兩篇NMT的論文也是nmt 最經典的兩篇論文:(深扒的話仍是看論文吧)
一、Bahdanau 的 Attention
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
https://arxiv.org/pdf/1409.0473.pdf
二、Luong 的 Attention
Effective Approaches to Attention-based Neural Machine Translation
https://arxiv.org/pdf/1508.04025.pdf
如下是兩篇論文中如何使用Attention:
圖:兩個attention
二者的區別:
主要區別在於如何評估當前解碼器輸入和編碼器輸出之間的類似性。
tensorflow代碼中封裝好的,共有四個attention函數:
一、加入了得分偏置 bias 的 Bahdanau 的 attention
class BahdanauMonotonicAttention()
二、無得分偏置的Bahdanau 的 attention
class BahdanauAttention()
三、加入了得分偏置 bias 的Luong 的 attention
class LuongMonotonicAttention()
四、無得分偏置的Luong 的 attention
class LuongAttention()
貼一個直接封裝好的代碼encode 和 decoder 的代碼:詳細代碼稍後續上
主要用到有如下幾個函數:attention + beamsearch
tf.contrib.seq2seq.tile_batch
tf.contrib.seq2seq.LuongAttention
tf.contrib.seq2seq.BahdanauAttention
tf.contrib.seq2seq.AttentionWrapper
tf.contrib.seq2seq.TrainingHelper
代碼片斷:
def decoder(mode,encoder_outputs,encoder_state,X_len,word2id_tar,embeddings_Y,embedded_Y): k_initializer = tf.contrib.layers.xavier_initializer() with tf.variable_scope('decoder'): net_mode = hp.dec_mode beam_width = hp.beam_size batch_size = hp.batch_size memory = encoder_outputs num_layers = hp.dec_num_layers if mode == 'infer': memory = tf.contrib.seq2seq.tile_batch(memory, beam_width) X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width) encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width) bs = batch_size * beam_width else: bs = batch_size attention = tf.contrib.seq2seq.LuongAttention(hp.dec_hidden_size, memory, X_len, scale=True) # multiplicative # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive cell = multi_cells(num_layers * 2,mode,net_mode) cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hp.dec_hidden_size, name='attention') decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state) with tf.variable_scope('projected'): output_layer = tf.layers.Dense(len(word2id_tar), use_bias=False, kernel_initializer=k_initializer) if mode == 'infer': start = tf.fill([batch_size], word2id_tar['<s>']) decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_tar['</s>'], decoder_initial_state, beam_width, output_layer) outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, maximum_iterations=1 * tf.reduce_max(X_len)) sample_id = outputs.predicted_ids print ("sample_id shape") print (sample_id.get_shape()) return "",sample_id else: helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [hp.maxlen - 1 for b in range(batch_size)]) decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer) outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True) logits = outputs.rnn_output logits = tf.transpose(logits, (1, 0, 2)) print(logits) return logits,""
貼一下 google nmt 的代碼:google 裏面寫的也很詳細了
主要三部分: attention,encoder,decoder,計算方式如上圖,流程如如下代碼所描述;
#兩個 attention代碼:依據的是: 上圖:兩個attention class LuongAttentionAttention(tf.keras.Model): def __init__(self, units): super(LuongAttention, self).__init__() self.W = tf.keras.layers.Dense(units) def call(self, query, values): # hidden shape == (batch_size, hidden size) # hidden_with_time_axis shape == (batch_size, 1, hidden size) # we are doing this to perform addition to calculate the score hidden_with_time_axis = tf.expand_dims(query, 1) # score shape == (batch_size, max_length, hidden_size) #矩陣轉置 轉置前:[batch_size,max_length,hidden_size] 轉置後:[batch_size,hidden_size,max_length] score = tf.transpose(values, perm=[0, 2, 1])*self.W(hidden_with_time_axis))) # attention_weights shape == (batch_size, max_length, 1) # we get 1 at the last axis because we are applying score to self.V attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights #BahdanauAttention:#計算 attention class BahdanauAttention(tf.keras.Model): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1) def call(self, query, values): # hidden shape == (batch_size, hidden size) # hidden_with_time_axis shape == (batch_size, 1, hidden size) # we are doing this to perform addition to calculate the score hidden_with_time_axis = tf.expand_dims(query, 1) # score shape == (batch_size, max_length, hidden_size) score = self.V(tf.nn.tanh( self.W1(values) + self.W2(hidden_with_time_axis))) # attention_weights shape == (batch_size, max_length, 1) # we get 1 at the last axis because we are applying score to self.V attention_weights = tf.nn.softmax(score, axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * values context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights decoder 的部分代碼 class Decoder(tf.keras.Model): def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz): super(Decoder, self).__init__() self.batch_sz = batch_sz self.dec_units = dec_units self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) # 參數簡說: self.gru = tf.keras.layers.GRU(self.dec_units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform') self.fc = tf.keras.layers.Dense(vocab_size) # used for attention self.attention = BahdanauAttention(self.dec_units) def call(self, x, hidden, enc_output): # enc_output shape == (batch_size, max_length, hidden_size) #調用 attention 函數,傳入,上個時刻的 hidden 和 encoder 的 outputs # context_vector 加權平均後的 Ci(論文中的),attention_weights 權重值 context_vector, attention_weights = self.attention(hidden, enc_output) # x shape after passing through embedding == (batch_size, 1, embedding_dim) x = self.embedding(x) # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size) # context_vector 和 embedding 後的 X 進行結合 x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1) # passing the concatenated vector to the GRU # 此時的 output 應該 等於 state; output, state = self.gru(x) # output shape == (batch_size * 1, hidden_size) output = tf.reshape(output, (-1, output.shape[2])) # output shape == (batch_size, vocab) x = self.fc(output) # 輸出 outputs 全鏈接以後的 x,隱藏層的state,attention 的score,x在訓練的時候直接做爲損失; return x, state, attention_weights #訓練的部分代碼: def train_step(inp, targ, enc_hidden): loss = 0 with tf.GradientTape() as tape: # encoder 部分的代碼,直接取的全部的輸出和最後的隱藏層; enc_output, enc_hidden = encoder(inp, enc_hidden) dec_hidden = enc_hidden dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1) # Teacher forcing - feeding the target as the next input #按照句子的長度一個一個的進行輸入; for t in range(1, targ.shape[1]): # passing enc_output to the decoder # 得到decoder 每一時刻的輸出 和隱藏層的輸出; predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output) loss += loss_function(targ[:, t], predictions) # using teacher forcing dec_input = tf.expand_dims(targ[:, t], 1) batch_loss = (loss / int(targ.shape[1])) variables = encoder.trainable_variables + decoder.trainable_variables gradients = tape.gradient(loss, variables) optimizer.apply_gradients(zip(gradients, variables)) return batch_loss