基於 TensorFlow 2.0 的長短時間記憶網絡進行多類文本分類

時間 2021-04-06

標籤 git github 數組網絡架構 app ide 函數 post 學習欄目系統網絡简体版

原文原文鏈接

對天然語言處理（Natural Language Processing，NLP）領域來講，不少創新之處都是關於如何在詞向量中加入上下文。經常使用的方法之一就是使用遞歸神經網絡（Recurrent Neural Networks，RNN）。下面是遞歸神經網絡的概念： git

它們利用順序信息。github
它們具有記憶能力，可以記住到目前爲止計算過的內容，也就是說，我最後說的內容將影響我接下來要講的內容。數組
遞歸神經網絡是文本和語音分析的理想選擇。網絡
最經常使用的遞歸神經網絡是長短時間記憶網絡（Long-Short Term Memory，LSTM）。架構

上圖是遞歸神經網絡的架構。 app

「A」是前饋神經網絡（Feedforward neural network）的一層。ide
若是咱們只看右邊的話，它會遞歸地遍歷每一個序列的元素。函數
若是咱們將左邊展開，它看起來將會跟右邊如出一轍。post

譯註： 前饋神經網絡（Feedforward neural network），是最先發明、最簡單的人工神經網絡類型。在它內部，參數從輸入層向輸出層單向傳播。和遞歸神經網絡不通，它內部不會構成有向環。學習

假設咱們正在解決新聞文章數據集的文檔分類問題。

咱們輸入每一個單詞，這些單詞以某種方式相互關聯。
當咱們看到文章中全部的單詞時，咱們會在文章末尾作出預測。
遞歸神經網絡經過傳遞上一次輸出的輸入，可以保留信息，並可以在最後利用全部信息進行預測。

這對於短句頗有效，但當咱們處理一篇長文章時，將會有一個長期依賴問題。

所以，咱們一般不是用普通的遞歸神經網絡，而是使用長短時間記憶網絡。長短時間記憶網絡是一種遞歸神經網絡，能夠解決這種長期依賴問題。

譯註： 長短時間記憶網絡（Long Short-Term Memory，LSTM），是一種時間遞歸神經網絡，適合於處理和預測時間序列中間隔和延遲相對較長的重要事件。基於長短時間記憶網絡的系統能夠實現機器翻譯、視頻分析、文檔摘要、語音識別、圖像識別、手寫識別、控制聊天機器人、合成音樂等任務。

在咱們的新聞文章文檔分類示例中，有這種多對一的關係。輸入是單詞序列，而輸出是單個類或標籤。

如今，咱們將使用 TensorFlow 2.0 和 Keras，解決一個使用長短時間記憶網絡的 BBC 新聞文檔分類問題。數據集能夠點擊此連接（https://raw.githubusercontent.com/susanli2016/PyCon-Canada-2019-NLP-Tutorial/master/bbc-text.csv）來獲取。

首先，咱們導入庫，並確保 TensorFlow 是正確的版本。

import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
print(tf.__version__)

將超參數置於頂部，以下所示，便於進行更改和編輯。

屆時，咱們將會講解每一個超參數是如何工做的。

vocab_size = 5000
embedding_dim = 64
max_length = 200
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_portion = .8

定義兩個包含文章和標籤的列表。同時，咱們刪除了停用詞。

articles = []
labels = []
with open("bbc-text.csv", 'r') as csvfile:
  reader = csv.reader(csvfile, delimiter=',')
  next(reader)
  for row in reader:
      labels.append(row[0])
      article = row[1]
      for word in STOPWORDS:
          token = ' ' + word + ' '
          article = article.replace(token, ' ')
          article = article.replace(' ', ' ')
      articles.append(article)
print(len(labels))
print(len(articles))

數據中有 2225 篇新聞文章，咱們將它們分爲訓練集和驗證集，根據咱們以前設置的參數，80% 用於訓練，20% 用於驗證。

train_size = int(len(articles) * training_portion)
train_articles = articles[0: train_size]
train_labels = labels[0: train_size]
validation_articles = articles[train_size:]
validation_labels = labels[train_size:]
print(train_size)
print(len(train_articles))
print(len(train_labels))
print(len(validation_articles))
print(len(validation_labels))

詞法分析器（Tokenizer）爲咱們承擔了全部繁重的工做。在咱們的文章中，它將進行標記化，須要 5000 個最多見的單詞。oov_token 是在遇到不可見的單詞時放入一個特殊的值。這意味着咱們但願 <OOV> 用於不在 word_index 中的單詞。fit_on_text 將遍歷全部文本，並建立以下詞典：

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_articles)
word_index = tokenizer.word_index
dict(list(word_index.items())[0:10])

譯註： 詞法分析器（Tokenizer），是計算機科學中將字符串行轉換爲標記（token）串行的過程。進行詞法分析的進程或者函數叫做詞法分析器（lexical analyzer，簡稱 lexer），也叫掃描器（scanner）。詞法分析器通常以函數的形式存在，供語法分析器調用。

咱們能夠看到，「」是咱們語料庫中最多見的令牌，其次是「said」、「mr」等等。

完成標記化以後，下一步就是將這些標記轉換爲序列列表。下面是已經轉換成序列的訓練數據中的第 11 篇文章。

train_sequences = tokenizer.texts_to_sequences(train_articles)print(train_sequences[10])

圖 1

當咱們爲天然語言處理訓練神經網絡時，咱們須要相同大小的序列，這就是咱們爲何使用填充的緣由。若是你查看一下的話，就會發現，咱們的 max_length是 200，因此咱們使用 pad_sequences ，將全部文章的長度都設置爲 200。結果，你會看到第一篇文章長度爲 426，變成了 200；第二篇是 192，也變成了 200。以此類推。

train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(len(train_sequences[0]))
print(len(train_padded[0]))
print(len(train_sequences[1]))
print(len(train_padded[1]))
print(len(train_sequences[10]))
print(len(train_padded[10]))

此外，還有 padding_type 和 truncating_type，還有全部的 post，例如，第 11 篇文章的長度是 186，咱們須要填充到 200，咱們就在結尾處開始填充，也就是說，填充了 14 個 0。

print(train_padded[10])

圖 2

對於第一篇文章，它的長度爲 426，咱們須要將其截斷到 200，咱們就在結尾處截斷。

而後，咱們對驗證序列執行一樣的操做。

validation_sequences = tokenizer.texts_to_sequences(validation_articles)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(len(validation_sequences))
print(validation_padded.shape)

如今，咱們來看一下標籤。由於咱們的標籤是文本，所以，咱們將它們進行標記。在訓練時，標籤應該是 numpy 數組。因此，咱們要將標籤列表轉換爲 numpy 數組，以下所示：

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))print(training_label_seq[0])
print(training_label_seq[1])
print(training_label_seq[2])
print(training_label_seq.shape)
print(validation_label_seq[0])
print(validation_label_seq[1])
print(validation_label_seq[2])
print(validation_label_seq.shape)

在訓練深度神經網絡以前，咱們應該探索一下咱們的原始文章和填充後的文章是什麼樣子的。運行下面的代碼，咱們瀏覽第 11 篇文章，能夠看到，一些單詞變成了「」，由於它們沒有進入前 5000。

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_article(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('---')
print(train_articles[10])

圖 3

如今，是實施長短時間記憶網絡的時候了。

咱們構建了一個 tf.keras.Sequential 模型，從嵌入層開始。嵌入層爲每一個單詞存儲一個向量。調用時，它將單詞索引序列轉換爲向量序列。通過訓練後，具備類似意義的單詞，一般會具備類似的向量。
雙向包裝器（Bidirectional wrapper）與 LSTM 層一塊兒使用，它經過 LSTM 層向前和向後傳播輸入，而後鏈接輸出。這有助於長短時間記憶網絡學習長期依賴關係。而後咱們將其擬合到密集神經網絡（Dense Neural Network）中進行分類。
咱們使用 relu 代替 than 函數，由於這兩個函數可以彼此很好地相互替代。

咱們添加了 6 個單位和 softmax 激活的密集層（Dense Layer）。當咱們有多個輸出時，softmax將輸出層轉換爲機率分佈。

model = tf.keras.Sequential([
  # Add an Embedding layer expecting input vocab of size 5000, and output embedding dimension of size 64 we set at the top
  tf.keras.layers.Embedding(vocab_size, embedding_dim),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
  # use ReLU in place of tanh function since they are very good alternatives of each other.
  tf.keras.layers.Dense(embedding_dim, activation='relu'),
  # Add a Dense layer with 6 units and softmax activation.
  # When we have multiple outputs, softmax convert outputs layers into a probability distribution.
  tf.keras.layers.Dense(6, activation='softmax')
])
model.summary()

圖 4

在咱們的模型摘要中，咱們有嵌入，雙向包含長短時間記憶網絡，而後就是兩個密集層（Dense layer）。雙向的輸出爲 128，由於它是咱們在長短時間記憶網絡中輸入的兩倍。咱們也能夠堆疊 LSTM 層，但咱們發現，結果反而更糟。

print(set(labels))

咱們總共有 5 個標籤，但由於咱們沒有對標籤進行獨熱編碼（One-hot encode），所以，咱們不得不使用 sparse_categorical_crossentropy 做爲損失函數，它彷佛認爲 0 也是一個可能的標籤，而詞法分析器對象是從整數 1 開始標記化，而不是整數 0。結果，儘管從未使用過 0，但最後一個密集層須要標籤 0、一、二、三、四、5 的輸出。

若是你但願最後一個密集層爲 5，那麼你就須要從訓練和驗證標籤中減去 1。我決定保持現狀。

我決定訓練 10 個輪數，正如你將看到的，這是不少輪數。

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
history = model.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq), verbose=2)

圖 5

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

圖 6

咱們可能只需 3 到 4 個輪數。在訓練結束時，咱們能夠發現有點過擬合。

在後續文章中，咱們將致力於改進這一模型。

你能夠在 Github 找到本文的 Jupyter notebook。