做者|DR. VAIBHAV KUMAR
編譯|VK
來源|Analytics In Diamag
python
天然語言處理(NLP)有不少有趣的應用,文本生成就是其中一個有趣的應用。網絡
當一個機器學習模型工做在諸如循環神經網絡、LSTM-RNN、GRU等序列模型上時,它們能夠生成輸入文本的下一個序列。app
PyTorch提供了一組功能強大的工具和庫,這些工具和庫爲這些基於NLP的任務增添了動力。它不只須要較少的預處理量,並且加快了訓練過程。dom
在本文中,咱們將在PyTorch中訓練幾種語言的循環神經網絡(RNN)。訓練成功後,RNN模型將預測屬於以輸入字母開頭的語言的名稱。機器學習
PyTorch實現
這個實現是在Google Colab中完成的,其中的數據集是從Google驅動器獲取的。因此,首先,咱們將用Colab Notebook安裝Google驅動器。函數
from google.colab import drive drive.mount('/content/gdrive')
如今,咱們將導入全部必需的庫。工具
from __future__ import unicode_literals, print_function, division from io import open import glob import os import unicodedata import string import torch import torch.nn as nn import random import time import math import matplotlib.pyplot as plt import matplotlib.ticker as ticker
下面的代碼片斷將讀取數據集。學習
all_let = string.ascii_letters + " .,;'-" n_let = len(all_let) + 1 def getFiles(path): return glob.glob(path) # Unicode字符串到ASCII def unicodeToAscii(s): return ''.join( c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn' and c in all_let ) # 讀一個文件並分紅幾行 def getLines(filename): lines = open(filename, encoding='utf-8').read().strip().split('\n') return [unicodeToAscii(line) for line in lines] # 創建cat_lin字典,存儲每一個類別的行列表 cat_lin = {} all_ctg = [] for filename in getFiles('gdrive/My Drive/Dataset/data/data/names/*.txt'): categ = os.path.splitext(os.path.basename(filename))[0] all_ctg.append(category) lines = getLines(filename) cat_lin[categ] = lines n_ctg = len(all_ctg)
在下一步中,咱們將定義module類來生成名稱。該模塊將是一個循環神經網絡。測試
class NameGeneratorModule(nn.Module): def __init__(self, inp_size, hid_size, op_size): super(NameGeneratorModule, self).__init__() self.hid_size = hid_size self.i2h = nn.Linear(n_ctg + inp_size + hid_size, hid_size) self.i2o = nn.Linear(n_ctg + inp_size + hid_size, op_size) self.o2o = nn.Linear(hid_size + op_size, op_size) self.dropout = nn.Dropout(0.1) self.softmax = nn.LogSoftmax(dim=1) def forward(self, category, input, hidden): inp_comb = torch.cat((category, input, hidden), 1) hidden = self.i2h(inp_comb) output = self.i2o(inp_comb) op_comb = torch.cat((hidden, output), 1) output = self.o2o(op_comb) output = self.dropout(output) output = self.softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hid_size)
如下函數將用於從列表中選擇隨機項,從類別中選擇隨機行google
def randChoice(l): return l[random.randint(0, len(l) - 1)] def randTrainPair(): category = randChoice(all_ctg) line = randChoice(cat_lin[category]) return category, line
如下函數將數據轉換爲RNN模塊的兼容格式。
def categ_Tensor(categ): li = all_ctg.index(categ) tensor = torch.zeros(1, n_ctg) tensor[0][li] = 1 return tensor def inp_Tensor(line): tensor = torch.zeros(len(line), 1, n_let) for li in range(len(line)): letter = line[li] tensor[li][0][all_let.find(letter)] = 1 return tensor def tgt_Tensor(line): letter_indexes = [all_let.find(line[li]) for li in range(1, len(line))] letter_id.append(n_let - 1) # EOS return torch.LongTensor(letter_id)
如下函數將建立隨機訓練示例,包括類別、輸入和目標張量。
#損失 criterion = nn.NLLLoss() #學習率 lr_rate = 0.0005 def train(category_tensor, input_line_tensor, target_line_tensor): target_line_tensor.unsqueeze_(-1) hidden = rnn.initHidden() rnn.zero_grad() loss = 0 for i in range(input_line_tensor.size(0)): output, hidden = rnn(category_tensor, input_line_tensor[i], hidden) l = criterion(output, target_line_tensor[i]) loss += l loss.backward() for p in rnn.parameters(): p.data.add_(p.grad.data, alpha=-lr_rate) return output, loss.item() / input_line_tensor.size(0)
爲了顯示訓練期間的時間,定義如下函數。
def time_taken(since): now = time.time() s = now - since m = math.floor(s / 60) s -= m * 60 return '%dm %ds' % (m, s)
在下一步中,咱們將定義RNN模型。
model = NameGenratorModule(n_let, 128, n_let)
咱們將看到定義的RNN模型的參數。
print(model)
下一步,該模型將訓練10000個epoch。
epochs = 100000 print_every = 5000 plot_every = 500 all_losses = [] total_loss = 0 # 每次迭代時重置 start = time.time() for iter in range(1, epochs + 1): output, loss = train(*rand_train_exp()) total_loss += loss if iter % print_every == 0: print('Time: %s, Epoch: (%d - Total Iterations: %d%%), Loss: %.4f' % (time_taken(start), iter, iter / epochs * 100, loss)) if iter % plot_every == 0: all_losses.append(total_loss / plot_every) total_loss = 0
咱們將可視化訓練中的損失。
plt.figure(figsize=(7,7)) plt.title("Loss") plt.plot(all_losses) plt.xlabel("Epochs") plt.ylabel("Loss") plt.show()
最後,咱們將對咱們的模型進行測試,以測試在給定起始字母表字母時生成屬於語言的名稱。
max_length = 20 # 類別和起始字母中的示例 def sample_model(category, start_letter='A'): with torch.no_grad(): # no need to track history in sampling category_tensor = categ_Tensor(category) input = inp_Tensor(start_letter) hidden = NameGenratorModule.initHidden() output_name = start_letter for i in range(max_length): output, hidden = NameGenratorModule(category_tensor, input[0], hidden) topv, topi = output.topk(1) topi = topi[0][0] if topi == n_let - 1: break else: letter = all_let[topi] output_name += letter input = inp_Tensor(letter) return output_name # 從一個類別和多個起始字母中獲取多個樣本 def sample_names(category, start_letters='XYZ'): for start_letter in start_letters: print(sample_model(category, start_letter))
如今,咱們將檢查樣本模型,在給定語言和起始字母時生成名稱。
print("Italian:-") sample_names('Italian', 'BPRT') print("\nKorean:-") sample_names('Korean', 'CMRS') print("\nRussian:-") sample_names('Russian', 'AJLN') print("\nVietnamese:-") sample_names('Vietnamese', 'LMT')
所以,正如咱們在上面看到的,咱們的模型已經生成了屬於語言類別的名稱,並從輸入字母開始。
參考文獻:
- Trung Tran, 「Text Generation with Pytorch」.
- 「NLP from scratch: Generating names with a character level RNN」, PyTorch Tutorial.
- Francesca Paulin, 「Character-Level LSTM in PyTorch」, Kaggle.
原文連接:https://analyticsindiamag.com/recurrent-neural-network-in-pytorch-for-text-generation/
歡迎關注磐創AI博客站:
http://panchuang.net/
sklearn機器學習中文官方文檔:
http://sklearn123.com/
歡迎關注磐創博客資源彙總站:
http://docs.panchuang.net/