【一】聲明python
本文源自TensorFlow官方指導(https://tensorflow.google.cn/tutorials/sequences/text_generation),增長了部分細節說明。api
【二】綜述app
1. tf.keras與keras有以下三個較大的不一樣點ide
1):opt必須是tf.train模塊下的opt,不能是keras下的opt函數
2):tf.keras模型默認的保存格式時check-point,不是h5.網站
3): tf.keras進行模型訓練、推理時,input_data能夠直接傳遞tf.data.Datasetui
2. TensorFlow的官方案例,是依據字符進行文本生成的,基本流程是,給定一個句子,預測其下一個字符。所以該模型不知道單詞是怎麼拼寫的,不知道字成詞,由於他是字符級的,它只知道預測下一個字符。所以,可能生成了不存在的單詞或者詞語。google
3. 該模型只有三層(char-embedding、GRU、FC),但參數巨大,訓練十分緩慢(i7CPU訓練一個epoch差很少半個小時)。並且這裏,char-embedding是直接訓練出來了,而不是經過fasttext或者gensim訓練出來,而後在作fine-tuning的。lua
【三】代碼以下:spa
# -*- coding:utf-8 -*- import tensorflow as tf import numpy as np import os import time tf.enable_eager_execution() # 1. 數據下載 path = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt') #2. 數據預處理 with open(path) as f: # text 是個字符串 text = f.read() # 3. 將組成文本的字符所有提取出來,注意 vocab是個list vocab = sorted(set(text)) # 4. 建立text-->int的映射關係 char2idx = {u:i for i, u in enumerate(vocab)} idx2char = np.array(vocab) text_as_int = np.array([char2idx[c] for c in text]) # 5. 借用dataset的batch方法,將text劃分爲定長的句子 seq_length = 100 examples_per_epoch = len(text)//seq_length char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) # 這裏batch_size 加1的緣由在於,下面對inputs和labels的生成。labels比inputs多一個字符 sequences = char_dataset.batch(seq_length+1, drop_remainder=True) # 6. 將每一個句子劃分爲inputs和labels。例如:hello,inputs = hell,label=ello def split_input_target(chunk): input_text = chunk[:-1] target_text = chunk[1:] return input_text, target_text dataset = sequences.map(split_input_target) # 7. 將句子劃分爲一個個batch BATCH_SIZE = 64 steps_per_epoch = examples_per_epoch//BATCH_SIZE BUFFER_SIZE = 10000 # drop_remainder 通常須要設置爲true,表示當最後一組數據不夠劃分爲一個batch時,將這組數據丟棄 dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True) # 8. 模型搭建 # Length of the vocabulary in chars vocab_size = len(vocab) # The embedding dimension embedding_dim = 256 # Number of RNN units rnn_units = 1024 model = tf.keras.Sequential() # 這裏是字符embedding,因此是字符集大小*embedding_dim model.add(tf.keras.layers.Embedding(input_dim=vocab_size,output_dim=embedding_dim, batch_input_shape=[BATCH_SIZE,None])) model.add(tf.keras.layers.GRU(units=rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform', stateful=True)) model.add(tf.keras.layers.Dense(units=vocab_size)) model.summary() # 9. 模型配置 # optimizer 必須爲 tf.train 下的opt,不能是keras下的opt model.compile(optimizer=tf.train.AdamOptimizer(),loss=tf.losses.sparse_softmax_cross_entropy) # 10 .設置回調函數 checkpoint_dir = './training_checkpoints' checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True) tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./logs/') # 11. 訓練模型,repeat() 表示dataset無限循環,否則這裏數據可能不夠30個epochs model.fit(dataset.repeat(),epochs=30, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback,tensorboard_callback]) # 12 .模型保存 # 保存爲keras模型格式 model.save_weights(filepath='./models/gen_text_with_char_on_rnn.h5',save_format='h5') # 保存爲TensorFlow的格式 model.save_weights(filepath='./models/gen_text_with_char_on_rnn_check_point') # 13. 模型生成文本 def generate_text(model, start_string): # Evaluation step (generating text using the learned model) # Number of characters to generate num_generate = 1000 # You can change the start string to experiment start_string = 'ROMEO' # Converting our start string to numbers (vectorizing) input_eval = [char2idx[s] for s in start_string] input_eval = tf.expand_dims(input_eval, 0) # Empty string to store our results text_generated = [] # Low temperatures results in more predictable text. # Higher temperatures results in more surprising text. # Experiment to find the best setting. temperature = 1.0 # Here batch size == 1 model.reset_states() for i in range(num_generate): predictions = model(input_eval) # remove the batch dimension predictions = tf.squeeze(predictions, 0) # using a multinomial distribution to predict the word returned by the model predictions = predictions / temperature predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy() # We pass the predicted word as the next input to the model # along with the previous hidden state input_eval = tf.expand_dims([predicted_id], 0) text_generated.append(idx2char[predicted_id]) return (start_string + ''.join(text_generated)) print(generate_text(model, start_string="ROMEO: "))
【四】總結
1. 關於tf.keras更多的內容,能夠參考官方網站(https://tensorflow.google.cn/guide/keras)
2. 關於tf.dataset的更多內容,能夠參考官方網站(https://tensorflow.google.cn/guide/datasets)和另一篇博客(https://my.oschina.net/u/3800567/blog/1637798)
3. 能夠徹底使用tf.keras,再也不使用keras。兩者功能與接口一致,tf.keras提供了更多的對TensorFlow的支持