使用RNN對文本進行分類實踐電影評論

本教程在IMDB大型影評數據集 上訓練一個循環神經網絡進行情感分類。網絡

使用RNN對文本進行分類實踐電影評論 (tensorflow2官方教程翻譯)

 

from __future__ import absolute_import, division, print_function, unicode_literals
# !pip install tensorflow-gpu==2.0.0-alpha0
import tensorflow_datasets as tfds
import tensorflow as tf

導入matplotlib並建立一個輔助函數來繪製圖形函數

import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()

1. 設置輸入管道

IMDB大型電影影評數據集是一個二元分類數據集,全部評論都有正面或負面的情緒標籤。學習

使用TFDS下載數據集,數據集附帶一個內置的子字標記器測試

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

因爲這是一個子字標記器,它能夠傳遞任何字符串,而且標記器將對其進行標記。this

tokenizer = info.features['text'].encoder
print ('Vocabulary size: {}'.format(tokenizer.vocab_size))
Vocabulary size: 8185
sample_string = 'TensorFlow is cool.'
tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
assert original_string == sample_string
Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975]
The original string: TensorFlow is cool.

若是字符串不在字典中,則標記生成器經過將字符串分解爲子字符串來對字符串進行編碼。編碼

for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
4265 ----> ow
9 ----> is
2724 ----> cool
7975 ----> .
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

2. 建立模型

構建一個tf.keras.Sequential模型並從嵌入層開始,嵌入層每一個字存儲一個向量,當被調用時,它將單詞索引的序列轉換爲向量序列,這些向量是可訓練的,在訓練以後(在足夠的數據上),具備類似含義的詞一般具備類似的向量。lua

這種索引查找比經過tf.keras.layers.Dense層傳遞獨熱編碼向量的等效操做更有效。翻譯

遞歸神經網絡(RNN)經過迭代元素來處理序列輸入,RNN將輸出從一個時間步傳遞到其輸入端,而後傳遞到下一個時間步。code

tf.keras.layers.Bidirectional包裝器也能夠與RNN層一塊兒使用。這經過RNN層向前和向後傳播輸入,而後鏈接輸出。這有助於RNN學習遠程依賴性。orm

model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# 編譯Keras模型以配置訓練過程:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

3. 訓練模型

history = model.fit(train_dataset, epochs=10,
validation_data=test_dataset)
...
Epoch 10/10
391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714
Test Accuracy: 0.787320017815

上面的模型沒有屏蔽應用於序列的填充。若是咱們對填充序列進行訓練,並對未填充序列進行測試,就會致使偏斜。理想狀況下,模型應該學會忽略填充,可是正如您在下面看到的,它對輸出的影響確實很小。

若是預測 >=0.5,則爲正,不然爲負。

def pad_to_size(vec, size):
zeros = [0] * (size - len(vec))
vec.extend(zeros)
return vec
def sample_predict(sentence, pad):
tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
if pad:
tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))
return (predictions)
# 對不帶填充的示例文本進行預測
sample_pred_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
[[ 0.68914342]]
# 對帶填充的示例文本進行預測
sample_pred_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
[[ 0.68634349]]
plot_graphs(history, 'accuracy')
使用RNN對文本進行分類實踐電影評論 (tensorflow2官方教程翻譯)

 

plot_graphs(history, 'loss')
使用RNN對文本進行分類實踐電影評論 (tensorflow2官方教程翻譯)

 

4. 堆疊兩個或更多LSTM層

Keras遞歸層有兩種能夠用的模式,由return_sequences構造函數參數控制:

  • 返回每一個時間步的連續輸出的完整序列(3D張量形狀 (batch_size, timesteps, output_features))。
  • 僅返回每一個輸入序列的最後一個輸出(2D張量形狀 (batch_size, output_features))。
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
validation_data=test_dataset)
...
Epoch 10/10
391/391 [==============================] - 154s 394ms/step - loss: 0.1120 - accuracy: 0.9643 - val_loss: 0.5646 - val_accuracy: 0.8070
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/Unknown - 45s 115ms/step - loss: 0.5646 - accuracy: 0.8070Test Loss: 0.564571284348
Test Accuracy: 0.80703997612
# 在沒有填充的狀況下預測示例文本
sample_pred_text = ('The movie was not good. The animation and the graphics '
'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
[[ 0.00393916]]
# 在有填充的狀況下預測示例文本
sample_pred_text = ('The movie was not good. The animation and the graphics '
'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
[[ 0.01098633]]
plot_graphs(history, 'accuracy')
使用RNN對文本進行分類實踐電影評論 (tensorflow2官方教程翻譯)

 

plot_graphs(history, 'loss')
使用RNN對文本進行分類實踐電影評論 (tensorflow2官方教程翻譯)

 

你能夠查看其它現有的遞歸層,例如GRU層。

相關文章
相關標籤/搜索