PyTorch在NLP任務中使用預訓練詞向量

時間 2019-11-30

標籤 pytorch nlp 任務使用訓練向量简体版

原文原文鏈接

在使用pytorch或tensorflow等神經網絡框架進行nlp任務的處理時，能夠經過對應的Embedding層作詞向量的處理，更多的時候，使用預訓練好的詞向量會帶來更優的性能。下面分別介紹使用gensim和torchtext兩種加載預訓練詞向量的方法。git

1.使用gensim加載預訓練詞向量
對於以下這樣一段語料github

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
構建詞表，此過程也可以使用Keras或torchtext來簡化完成，完整代碼見文末倉庫。緩存

# 給每一個單詞編碼，也就是用數字來表示每一個單詞，這樣纔可以傳入word embeding獲得詞向量。
vocab = set(test_sentence) # 經過set將重複的單詞去掉
word_to_idx = {word: i+1 for i, word in enumerate(vocab)}
# 定義了一個unknown的詞，也就是說沒有出如今訓練集裏的詞，咱們都叫作unknown，詞向量就定義爲0。
word_to_idx['<unk>'] = 0
idx_to_word = {i+1: word for i, word in enumerate(vocab)}
idx_to_word[0] = '<unk>'
1
2
3
4
5
6
7
使用gensim加載已訓練好的word2vec詞向量，此處用的是glove已訓練好的詞向量，下載連接：https://pan.baidu.com/s/1i5XmTA9 由於glove詞向量和word2vec詞向量格式略有不一樣，先使用gensim的scripts.glove2word2vec方法將glove詞向量轉化爲word2vec詞向量的格式。轉化方式很簡單，以下：網絡

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
# 已有的glove詞向量
glove_file = datapath('test_glove.txt')
# 指定轉化爲word2vec格式後文件的位置
tmp_file = get_tmpfile("test_word2vec.txt")
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
1
2
3
4
5
6
7
8
去詞向量文件中查表，獲得詞表中單詞對應的權重weight。在詞向量文件中沒匹配到的單詞則繼續保留全0向量。框架

# 使用gensim載入word2vec詞向量
wvmodel = gensim.models.KeyedVectors.load_word2vec_format('/Users/wyw/Documents/vectors/word2vec/word2vec.6B.100d.txt', binary=False, encoding='utf-8')
vocab_size = len(vocab) + 1
embed_size = 100
weight = torch.zeros(vocab_size, embed_size)less

for i in range(len(wvmodel.index2word)):
try:
index = word_to_idx[wvmodel.index2word[i]]
except:
continue
weight[index, :] = torch.from_numpy(wvmodel.get_vector(
idx_to_word[word_to_idx[wvmodel.index2word[i]]]))
1
2
3
4
5
6
7
8
9
10
11
12
13
獲得weight權重後，便可在PyTorch的Embedding層中就能夠指定預訓練的詞向量。工具

embedding = nn.Embedding.from_pretrained(weight)
# requires_grad指定是否在訓練過程當中對詞向量的權重進行微調
self.embedding.weight.requires_grad = True
1
2
3
完整代碼見個人github倉庫：https://github.com/atnlp/torchtext-summary 下的Language-Model.ipynb文件性能

2.使用torchtext加載預訓練的詞向量
下面介紹如何在torchtext中使用預訓練的詞向量，進而傳送給神經網絡模型進行訓練。關於torchtext更完整的用法見我另外一篇博客：TorchText用法示例及完整代碼ui

使用torchtext默認支持的預訓練詞向量
默認狀況下，會自動下載對應的預訓練詞向量文件到當前文件夾下的.vector_cache目錄下，.vector_cache爲默認的詞向量文件和緩存文件的目錄。編碼

from torchtext.vocab import GloVe
from torchtext import data
TEXT = data.Field(sequential=True)
# 如下兩種指定預訓練詞向量的方式等效
# TEXT.build_vocab(train, vectors="glove.6B.200d")
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
# 在這種狀況下，會默認下載glove.6B.zip文件，進而解壓出glove.6B.50d.txt, glove.6B.100d.txt, glove.6B.200d.txt, glove.6B.300d.txt這四個文件，所以咱們能夠事先將glove.6B.zip或glove.6B.200d.txt放在.vector_cache文件夾下(若不存在，則手動建立)。
1
2
3
4
5
6
7
指定預訓練詞向量和緩存文件所在目錄
上述使用預訓練詞向量文件的方式存在一大問題，即咱們每作一個nlp任務時，創建詞表時都須要在對應的.vector_cache文件夾中下載預訓練詞向量文件，如何解決這一問題？咱們可使用torchtext.vocab.Vectors中的name和cachae參數指定預訓練的詞向量文件和緩存文件的所在目錄。所以咱們也可使用本身用word2vec等工具訓練出的詞向量文件，只需將詞向量文件放在name指定的目錄中便可。

經過name參數能夠指定預訓練的詞向量文件所在的目錄
默認狀況下預訓練詞向量文件和緩存文件的目錄位置都爲當前目錄下的 .vector_cache目錄，雖然經過name參數指定了預訓練詞向量文件存在的目錄，可是由於緩存文件的目錄沒有特殊指定，此時在當前目錄下仍然須要存在 .vector_cache 目錄。

# glove.6B.200d.txt爲預先下載好的預訓練詞向量文件
if not os.path.exists(.vector_cache):
os.mkdir(.vector_cache)
vectors = Vectors(name='myvector/glove/glove.6B.200d.txt')
TEXT.build_vocab(train, vectors=vectors)
1
2
3
4
5
經過cache參數指定緩存目錄
# 更進一步的，能夠在指定name的同時同時指定緩存文件所在目錄，而不是使用默認的.vector_cache目錄
cache = '.vector_cache'
if not os.path.exists(cache):
os.mkdir(cache)
vectors = Vectors(name='myvector/glove/glove.6B.200d.txt', cache=cache)
TEXT.build_vocab(train, vectors=vectors)
1
2
3
4
5
6
在模型中指定Embedding層的權重
在使用預訓練好的詞向量時，咱們須要在神經網絡模型的Embedding層中明確地傳遞嵌入矩陣的初始權重。權重包含在詞彙表的vectors屬性中。以Pytorch搭建的Embedding層爲例：

# 經過pytorch建立的Embedding層
embedding = nn.Embedding(2000, 256)
# 指定嵌入矩陣的初始權重
weight_matrix = TEXT.vocab.vectors
embedding.weight.data.copy_(weight_matrix )
1
2
3
4
5
一個比較完整的示例
import torch
from torchtext import data
from torchtext import datasets
from torchtext.vocab import GloVe
import numpy as np

def load_data(opt):
# use torchtext to load data, no need to download dataset
print("loading {} dataset".format(opt.dataset))
# set up fields
text = data.Field(lower=True, include_lengths=True, batch_first=True, fix_length=opt.max_seq_len)
label = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(text, label)

# build the vocabulary
text.build_vocab(train, vectors=GloVe(name='6B', dim=300))
label.build_vocab(train)

# print vocab information print('len(TEXT.vocab)', len(text.vocab)) print('TEXT.vocab.vectors.size()', text.vocab.vectors.size())1234567891011121314151617181920212223完整代碼見個人GitHub倉庫：https://github.com/atnlp/torchtext-summary關於torchtext的其餘用法見個人博客：http://www.nlpuser.com/pytorch/2018/10/30/useTorchText/我的原創，未經容許不得轉載。--------------------- 做者：nlpuser 來源：CSDN 原文：https://blog.csdn.net/nlpuser/article/details/83627709 版權聲明：本文爲博主原創文章，轉載請附上博文連接！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。