選自Medium,做者:Susan Li,機器之心編譯。python
機器翻譯(MT)是一項極具挑戰性的任務,其研究如何使用計算機將文本或是語音從一種語言翻譯成另外一種語言。本文藉助 Keras 從最基本的文本加載與數據預處理開始,並討論了在循環神經網絡與編碼器解碼器框架下如何才能構建一個可接受的神經翻譯系統,本教程全部的代碼已在 GitHub 開源。
傳統意義上來講,機器翻譯通常使用高度複雜的語言知識開發出的大型統計模型,可是近來不少研究使用深度模型直接對翻譯過程建模,並在只提供原語數據與譯文數據的狀況下自動學習必要的語言知識。這種基於深度神經網絡的翻譯模型目前已經得到了最佳效果。git
項目地址:github.com/susanli2016…github
接下來,咱們將使用深度神經網絡來解決機器翻譯問題。咱們將展現如何開發一個將英文翻譯成法文的神經網絡機器翻譯模型。該模型將接收英文文本輸入同時返回法語譯文。更確切地說,咱們將構建 4 個模型,它們是:bash
訓練和評估深度神經網絡是一項計算密集型的任務。做者使用 AWS EC2 實例來運行全部代碼。若是你打算照着本文作,你得訪問 GPU 實例。網絡
加載庫架構
import collections
import helper
import numpy as np
import project_tests as tests
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
複製代碼
做者使用 help.py 加載數據,同時使用 project_test.py 測試函數。框架
數據函數
該數據集包含一個相對較小的詞彙表,其中 small_vocab_en 文件包含英文語句,small_vocab_fr 包含對應的法文翻譯。post
數據集下載地址:github.com/susanli2016…性能
加載數據
english_sentences = helper.load_data('data/small_vocab_en')
french_sentences = helper.load_data('data/small_vocab_fr')
print('Dataset Loaded')
複製代碼
small_vocab_en 中的每行包含一個英文語句,同時其法文翻譯位於 small_vocab_fr 中對應的每行。
for sample_i in range(2):
print('small_vocab_en Line {}: {}'.format(sample_i + 1, english_sentences[sample_i]))
print('small_vocab_fr Line {}: {}'.format(sample_i + 1, french_sentences[sample_i]))
複製代碼
問題的複雜性取決於詞彙表的複雜性。一個更復雜的詞彙表意味着一個更復雜的問題。對於將要處理的數據集,讓咱們看看它的複雜性。
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
複製代碼
預處理
咱們將使用如下預處理方法將文本轉化爲整數序列:
1. 將詞轉化爲 id 表達;
2. 加入 padding 使得每一個序列同樣長。
Tokensize(標記字符串)
使用 Keras 的 Tokenizer 函數將每一個語句轉化爲一個單詞 id 的序列。使用該函數來標記化英文語句和法文語句。
函數 tokenize 返回標記化後的輸入和類。
def tokenize(x):
x_tk = Tokenizer(char_level = False)
x_tk.fit_on_texts(x)
return x_tk.texts_to_sequences(x), x_tk
text_sentences = [
'The quick brown fox jumps over the lazy dog .',
'By Jove , my quick study of lexicography won a prize .',
'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(sent))
print(' Output: {}'.format(token_sent))
複製代碼
Padding
經過使用 Keras 的 pad_sequences 函數在每一個序列最後添加零以使得全部英文序列具備相同長度,全部法文序列具備相同長度。
def pad(x, length=None):
if length is None:
length = max([len(sentence) for sentence in x])
return pad_sequences(x, maxlen = length, padding = 'post')
tests.test_pad(pad)
# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(np.array(token_sent)))
print(' Output: {}'.format(pad_sent))
複製代碼
預處理流程
實現預處理函數:
def preprocess(x, y):
preprocess_x, x_tk = tokenize(x)
preprocess_y, y_tk = tokenize(y)
preprocess_x = pad(preprocess_x)
preprocess_y = pad(preprocess_y)
# Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
return preprocess_x, preprocess_y, x_tk, y_tk
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
preprocess(english_sentences, french_sentences)
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
複製代碼
模型
在本節中,咱們將嘗試各類神經網絡結構。咱們將訓練 4 個相對簡單的結構做爲開始:
在嘗試了 4 種簡單的結構以後,咱們將構建一個更深的模型,其性能要優於以上 4 種模型。
id 從新轉化爲文本
神經網絡將輸入轉化爲單詞 id,但這不是咱們最終想要的形式,咱們想要的是法文翻譯。logits_to_text 函數彌補了從神經網絡輸出的 logits 到法文翻譯之間的缺口,咱們將使用該函數更好地理解神經網絡的輸出。
def logits_to_text(logits, tokenizer):
index_to_words = {id: word for word, id in tokenizer.word_index.items()}
index_to_words[0] = '<PAD>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
print('`logits_to_text` function loaded.')
複製代碼
模型 1:RNN
咱們構建一個基礎的 RNN 模型,該模型是將英文翻譯成法文序列的良好基準。
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
input_seq = Input(input_shape[1:])
rnn = GRU(64, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation('softmax')(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_simple_model(simple_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
# Train the neural network
simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
複製代碼
基礎 RNN 模型的驗證集準確度是 0.6039。
模型 2:詞嵌入
詞嵌入是在 n 維空間中近義詞距離相近的向量表示,其中 n 表示嵌入向量的大小。咱們將使用詞嵌入來構建一個 RNN 模型。
from keras.models import Sequential
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
rnn = GRU(64, return_sequences=True, activation="tanh")
embedding = Embedding(french_vocab_size, 64, input_length=input_shape[1])
logits = TimeDistributed(Dense(french_vocab_size, activation="softmax"))
model = Sequential()
#em can only be used in first layer --> Keras Documentation
model.add(embedding)
model.add(rnn)
model.add(logits)
model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(learning_rate),
metrics=['accuracy'])
return model
tests.test_embed_model(embed_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
embeded_model = embed_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
embeded_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
print(logits_to_text(embeded_model.predict(tmp_x[:1])[0], french_tokenizer))
複製代碼
嵌入式模型的驗證集準確度是 0.8401。
模型 3:雙向 RNN
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
model = Sequential()
model.add(Bidirectional(GRU(128, return_sequences = True, dropout = 0.1),
input_shape = input_shape[1:]))
model.add(TimeDistributed(Dense(french_vocab_size, activation = 'softmax')))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_bd_model(bd_model)
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
bidi_model = bd_model(
tmp_x.shape,
preproc_french_sentences.shape[1],
len(english_tokenizer.word_index)+1,
len(french_tokenizer.word_index)+1)
bidi_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(bidi_model.predict(tmp_x[:1])[0], french_tokenizer))
複製代碼
雙向 RNN 模型的驗證集準確度是 0.5992。
模型 4:編碼器—解碼器框架
編碼器構建一個語句的矩陣表示,而解碼器將該矩陣做爲輸入並輸出預測的翻譯。
def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
model = Sequential()
model.add(GRU(128, input_shape = input_shape[1:], return_sequences = False))
model.add(RepeatVector(output_sequence_length))
model.add(GRU(128, return_sequences = True))
model.add(TimeDistributed(Dense(french_vocab_size, activation = 'softmax')))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_encdec_model(encdec_model)
tmp_x = pad(preproc_english_sentences)
tmp_x = tmp_x.reshape((-1, preproc_english_sentences.shape[1], 1))
encodeco_model = encdec_model(
tmp_x.shape,
preproc_french_sentences.shape[1],
len(english_tokenizer.word_index)+1,
len(french_tokenizer.word_index)+1)
encodeco_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=20, validation_split=0.2)
print(logits_to_text(encodeco_model.predict(tmp_x[:1])[0], french_tokenizer))
複製代碼
編碼器—解碼器模型的驗證集準確度是 0.6406。
模型 5:自定義深度模型
構建一個將詞嵌入和雙向 RNN 合併到一個模型中的 model_final。
至此,咱們須要須要作一些實驗,例如將 GPU 參數改成 256,將學習率改成 0.005,對模型訓練多於(或少於)20 epochs 等等。
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
model = Sequential()
model.add(Embedding(input_dim=english_vocab_size,output_dim=128,input_length=input_shape[1]))
model.add(Bidirectional(GRU(256,return_sequences=False)))
model.add(RepeatVector(output_sequence_length))
model.add(Bidirectional(GRU(256,return_sequences=True)))
model.add(TimeDistributed(Dense(french_vocab_size,activation='softmax')))
learning_rate = 0.005
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_model_final(model_final)
print('Final Model Loaded')
複製代碼
預測
def final_predictions(x, y, x_tk, y_tk):
tmp_X = pad(preproc_english_sentences)
model = model_final(tmp_X.shape,
preproc_french_sentences.shape[1],
len(english_tokenizer.word_index)+1,
len(french_tokenizer.word_index)+1)
model.fit(tmp_X, preproc_french_sentences, batch_size = 1024, epochs = 17, validation_split = 0.2)
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
y_id_to_word[0] = '<PAD>'
sentence = 'he saw a old yellow truck'
sentence = [x_tk.word_index[word] for word in sentence.split()]
sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
sentences = np.array([sentence[0], x[0]])
predictions = model.predict(sentences, len(sentences))
print('Sample 1:')
print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
print('Il a vu un vieux camion jaune')
print('Sample 2:')
print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))
final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)
複製代碼
咱們獲得了語句完美的翻譯同時驗證集準確度是 0.9776!