深度學習實戰：tensorflow訓練循環神經網絡讓AI創做出模仿莎士比亞風格的做品

時間 2020-03-22

標籤深度學習實戰 tensorflow 訓練循環神經網絡做出模仿莎士比亞風格简体版

原文原文鏈接

AI創做莎士比亞風格的做品訓練一個循環神經網絡模仿莎士比亞

FLORIZEL:
Should she kneel be?
In shall not weep received; unleased me
And unrespective greeting than dwell in, thee,
look’d on me, son in heavenly properly.git

這是誰寫的，莎士比亞仍是機器學習模型？spring

答案是後者！上面這篇文章是一個通過TensorFlow訓練的循環神經網絡的產物，通過30個epoch的訓練，並給出了一顆「FLORIZEL:」的種子。在本文中，我將解釋並給出如何訓練神經網絡來編寫莎士比亞戲劇或任何您但願它編寫的東西的代碼!api

導入和數據

首先導入一些基本庫數組

import tensorflow as tf
import numpy as np
import os
import time

TensorFlow內置了莎士比亞做品。若是您在像Kaggle這樣的在線環境中工做，請確保鏈接了互聯網。網絡

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

數據須要用utf-8進行解碼。架構

text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

[輸出]:app

Length of text: 1115394 charactersless

它裏面有不少的數據能夠用！dom

咱們看看前250個字符是什麼機器學習

print(text[:250])

向量化

首先看看文件裏面有多少不一樣的字符：

vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

[輸出]:

65 unique characters

在訓練以前，字符串須要映射到數字表示。
下面建立兩個表—一個表將字符映射到數字，另外一個表將數字映射到字符。

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in text])

查看向量字典：

print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

[輸出]：

{
'\n': 0,
' ' : 1,
'!' : 2,
'$' : 3,
'&' : 4,
"'" : 5,
',' : 6,
'-' : 7,
'.' : 8,
'3' : 9,
':' : 10,
...
}

每個不同的字符都有了編號。

咱們看看向量生成器如何處理做品的前兩個單詞 'First Citizen'

print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

這些單詞被轉換成一個數字向量，這個向量能夠很容易地經過整數到字符字典轉換回文本。

製造訓練數據

給定一個字符序列，該模型將理想地找到最有可能的下一個字符。
文本將被分紅幾個句子，每一個輸入句子將包含文本中的一個可變的seq_length字符。
任何輸入語句的輸出都將是輸入語句，向右移動一個字符。

例如，給定一個輸入「Hell」，輸出將是「ello」，從而造成單詞「Hello」。

首先，咱們可使用tensorflow的.from_tensor_slices函數將文本向量轉換爲字符索引。

# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

[輸出]:

F
i
r
s
t

批處理方法容許這些單個字符成爲肯定大小的序列，造成段落片斷。

sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

[輸出]:

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ' 'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k' "now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki" "ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d" 'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'

對於每一個序列，咱們將複製它並使用map方法移動它以造成一個輸入和一個目標。

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

如今，數據集已經變成了咱們想要的輸入和輸出。

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou' 
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

對向量的每一個索引進行一次性處理;對於第0步的輸入，模型接收「F」的數值索引，並嘗試預測「i」做爲下一個字符。在下一個時序步驟中，它作一樣的事情，可是RNN不只考慮前面的步驟，並且還考慮它剛纔預測的字符。

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

[輸出]:

Step 0
input: 18 ('F')
expected output: 47 ('i')
Step 1
input: 47 ('i')
expected output: 56 ('r')
Step 2
input: 56 ('r')
expected output: 57 ('s')
Step 3
input: 57 ('s')
expected output: 58 ('t')
Step 4
input: 58 ('t')
expected output: 1 (' ')

Tensorflow的 tf.data 能夠用來將文本分割成更易於管理的序列——但首先，須要將數據打亂並打包成批。

# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

[輸出]:

構建模型

最後，咱們能夠構建模型。讓咱們先設定一些重要的變量:

# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

模型將有一個嵌入層或輸入層，該層將每一個字符的數量映射到一個具備變量embedding_dim維數的向量。它將有一個GRU層(能夠用LSTM層代替)，大小爲units = rnn_units。最後，輸出層將是一個標準的全鏈接層，帶有vocab_size輸出。

下面的函數幫助咱們快速而清晰地建立一個模型。

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

經過調用函數組合模型架構。

model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

讓咱們總結一下咱們的模型，看看有多少參數。

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
=================================================================
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________

400萬的參數!咱們但願把它訓練的久一點。

聚集

這個問題如今能夠做爲一個分類問題來處理。
給定先前的RNN狀態和時間步長的輸入，預測表示下一個字符的類。
所以，咱們將附加一個稀疏分類熵損失函數和Adam優化器。

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())
model.compile(optimizer='adam', loss=loss)

[輸出]:

Prediction shape: (64, 100, 65) # (batch_size, sequence_length, vocab_size)
scalar_loss: 4.1746616

配置檢查點

模型訓練，尤爲是像莎士比亞戲劇這樣的大型數據集，須要很長時間。理想狀況下，咱們不會爲了作出預測而反覆訓練它。tf.keras.callbacks.ModelCheckpoint函數能夠在訓練期間將某些檢查點的權重保存到一個文件中，該文件能夠在一個空白模型被後續檢索。這在訓練因任何緣由中斷時也很方便。

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

最後，執行訓練

EPOCHS=30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

這應該須要大約6個小時的時間來得到不那麼使人印象深入但更快的結果，epochs能夠調整到10(任何小於5的都會徹底變成垃圾)。

生成文本

衝檢查點中恢復權重參數

tf.train.latest_checkpoint(checkpoint_dir)

用這些權重參數咱們能夠從新構建模型：

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

生成文本的步驟：

首先選擇一個種子字符串，初始化RNN狀態，並設置要生成的字符數。
使用開始字符串和RNN狀態得到下一個字符的預測分佈。
使用分類分佈計算預測字符的索引，並將其做爲模型的下一個輸入。
模型返回的RNN狀態被反饋回自身。
重複步驟2和步驟4，直到生成文本。

def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

最後，給定一個開始字符串，咱們能夠生成一些有趣的文本。

如今，欣賞一下兩個RNN的劇本吧，一個是訓練了10個epochs，另外一個是30個epochs。

這是訓練了10個epochs的

print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: how I, away too put That you shall have thieffort, are but love.

JULIET: Go, fight, sir: we say ‘Ay,’ and alack to stand and not to go to; And washt us him to-domm. Ay, my ows young; a man hear from his monsher to thee.

KING RICHARD III: Come, cease. O broteld the costime’s deforment! Thou wilt was quite.

PAULINA: I would you say the hour! Ah, hole for your company: But, good my lord; we have a king, of peace?

BALTHASAR: Cadul and washee could he ha! To curit her I may wench.

GLOUCESTER: Had you here shall such a pierce to temper; Or might his noble offery owe and speed Which seemest thy trims in a weaky amidude By this to the dother, dods citizens.

Third Citizen:
Madam sweet give reward, rebeire them With news gone! Pluck yielding: ’tis sign out things Within risess in strifes all ten times, To dish his finmers for briefily.

JULIET:
Gentlemen, God eveI come approbouting his wife as it, — triumphrous night change you gods, thou goest:
To which will dispersed and France.

哇!僅僅在10個epochs以後，就有了使人印象深入的理解。這些詞的拼寫準確性使人懷疑，但其中有明顯的情節衝突。寫做確定能夠改進。但願30-epoch模型能有更好的表現。

這是訓練了30個epochs的

欣賞一下徹底由RNN一個字一個字地創做出來的做品吧！

BRUTUS:
Could you be atherveshed him, our two,
But much a tale lendly fear;
For which we in thy shade of Naples.
Here’s no increase False to’t, offorit is the war of white give again.
This is the queen, whose vanoar’s head is worthly.
But cere it be a witch, some comfort.
What, nurse, I say!
Go Hamell.

FLORIZEL:
Should she kneel be?
In shall not weep received; unleased me
And unrespective greeting than dwell in, thee,
look’d on me, son in heavenly properly,
That ever you are my father is but straing;
Unless you would repossess him, hath always louded up,
You provokest. Good faith, o’erlar I can repart the heavens like deeds dills
For temper as soon as another maiden here, and he is bann’d upon which springs;
O’er most upon your voysus, I have no thunder; and my good villain!
Alest each other’s sleepings.
A fool; if this business prating duty
Does these traitors other sorrow.

LUCENTIO:
Tell me, they’s honourably.

Shepherd:
I know, my lord, to London, and you my moved join under him,
Great Apollo’s stan to make a book,
Both yet my father away towards Covent. Tut, And thou still’d by the earthmen lord r sensible your mother?

Servant:
Go, vill! We muster yet, for you’ll not: you are took good mad within your company in rage, I would you fight it so, his eye for every days,
To swear the beam of such a detects,
To Clarence dead to call upon you all I thank your grace, my father and my father, and yourself prevails
My father, hath a sword for hither;
Nor when thy heart is grown grave done.

QUEEN MARGARET: *
*Thou art a lodging very good and give thanks
With him.
But There is now in hand:
Therefore it be possish’d with Romeo dead.

MENENIUS:
Ha! little very welcome to my daughter’s sword,
Which haply my prayer’s legs, such as he does.
I am banks, sir, I’ll make you say ‘nough; for hither so better now to be so, sent it: it is stranger.

哇!有趣的是，這個模型甚至學會了在某些狀況下押韻(特別是Florizel的臺詞)。想象一下，在50甚至100個epochs以後，RNN能寫些什麼!

嗯，我猜測AI會讓做家失業

不徹底是這樣——但我能夠想象將來人工智能會發表大量設計成病毒式傳播的文章。這是一個挑戰——收集與主題相關的頂級文章，好比Human Parts或其餘相似出版物的文章，而後訓練人工智能撰寫熱門文章。發佈RNN的輸出，逐字地，看看效果如何！注意——我不建議在更專業的出版物上訓練RNN，好比Towards Data Science 或 Better Programming，由於它須要RNN在合理的時間內沒法學習的技術知識。然而，在RNN目前的能力範圍內，更多的哲學和非技術的寫做還行。

隨着文本生成變得愈來愈先進，它將有潛力比人類寫得更好，由於它有一個眼睛，什麼內容將像病毒同樣，什麼措推辭讀者感受良好，等等。使人震驚的是，有一天，機器能夠在人類最擅長的事情——寫做上擊敗人類。誠然，它沒法真正理解本身在寫什麼，但它會掌握人類的交流方式。

我想若是你不能戰勝他們，那就加入他們吧！

原文地址：https://imba.deephub.ai/p/051053806a5211ea90cd05de3860c663