[NLP-CNN] Convolutional Neural Networks for Sentence Classification -2014-EMNLP

時間 2019-12-13

標籤 nlp cnn convolutional neural networks sentence classification emnlp 简体版

原文原文鏈接

1. Overview

本文將CNN用於句子分類任務git

(1) 使用靜態vector + CNN便可取得很好的效果；=> 這代表預訓練的vector是universal的特徵提取器，能夠被用於多種分類任務中。github

(2) 根據特定任務進行fine-tuning 的vector + CNN 取得了更好的效果。網絡

(3) 改進模型架構，使得可使用 task-specific 和 static 的vector。架構

(4) 在7項任務中的4項取得了SOTA的效果。app

思考：卷積神經網絡的核心思想是捕獲局部特徵。在圖像領域，因爲圖像自己具備局部相關性，所以，CNN是一個較爲適用的特徵提取器。在NLP中，能夠將一段文本n-gram看作一個有相近特徵的片斷——窗口，於是但願經過CNN來捕獲這個滑動窗口內的局部特徵。卷積神經網絡的優點在於能夠對這樣的n-gram特徵進行組合和篩選，獲取不一樣的抽象層次的語義信息。less

2. Model

對於該模型，主要注意三點：ide

1. 如何應用的CNN，即在文本中如何使用CNN函數

2. 如何將static和fine-tuned vector結合在一個架構中測試

3. 正則化的策略ui

本文的思路是比較簡單的。

2.1 CNN的應用

<1> feature map 的獲取

word vector 是k維，sentence length = n (padded)，則將該sentence表示爲每一個單詞的簡單的concat,如fig1所示，組成最左邊的矩形。

卷積核是對窗口大小爲h的詞進行卷積。大小爲h的窗口內單詞的表徵爲 h * k 維度，那麼設定一個維度一樣爲h*k的卷積核 w，對其進行卷積運算。

以後加偏置，進行非線性變換便可獲得通過CNN以後提取的特徵的表徵$c_i$。

這個$c_i$是某一個卷積覈對一個窗口的卷積後的特徵表示，對於長度爲n的sentence，滑動窗口能夠滑動n - h + 1次，也就能夠獲得一個feature map

顯然，$c$的維度爲n - h + 1. 固然，這是對一個卷積核獲取的feature map, 爲了提取到多種特徵，能夠設置不一樣的卷積核，它們對應的卷積核的大小能夠不一樣，也就是h能夠不一樣。

這個過程對應了Figure1中最左邊兩個圖形的過程。

<2> max pooling

這裏的max pooling有個名詞叫 max-over-time-pooling.它的over-time體如今：如圖，每一個feature map中選擇值最大的組成到max pooling後的矩陣中，而這個feature map則是沿着滑動窗口，也就是沿着文本序列進行卷積獲得的，那麼也就是max pooling獲得的是分別在每個卷積核卷積下的，某一個滑動窗口--句子的某一個子序列卷積後的值，這個值相比於其餘滑動窗口的更大。句子序列是有前後順序的，滑動窗口也是，因此是 over-time.

這裏記爲：，是對應該filter的最大值。

<3> 全鏈接層

這裏也是採用全鏈接層，將前面層提取的信息，映射到分類類別中，獲取其機率分佈。

2.2 static 和 fine-tuned vector的結合

paper中，將句子分別用 static 和fine-tuned vector 表徵爲兩個channel。如Figure1最左邊的圖形所示，有兩個矩陣，這兩個矩陣分別表示用static 和fine-tuned vector拼接組成的句子的表徵。好比，前面的矩陣的第一行是wait這個詞的static的vector；後面的矩陣的第一行是wait這個詞的fine-tuned的vector.

兩者信息如何結合呢？

paper中的策略也很簡單，用一樣的卷積覈對其進行特徵提取後，將兩個channel得到的值直接Add在一塊兒，放到feature map中，這樣Figure1中的feature map其實是兩種vector進行特徵提取後信息的綜合。

2.3 正則化的策略

爲了不co-adapation問題，Hinton提出了dropout。在本paper中，對於倒數第二層，也就是max pooling後獲取的部分，也使用這樣的正則化策略。

假設有m個feature map, 那麼記。

若是不使用dropout,其通過線性映射的表示爲：

那麼若是使用dropout，其通過線性映射的表示爲：

這裏的$r$是m維的mask向量，其值或爲0，或爲1，其值爲1的機率服從伯努利分佈。

那麼在進行反向傳播時，只有對應mask爲1的單元，其梯度纔會傳播更新。

在測試階段，權值矩陣w會被scale p倍，即$\hat{w} = pw$，而且$\hat{w}$不進行dropout，來對訓練階段爲遇到過的數據進行score.

另外能夠選擇對$w$進行$l_2$正則化，當在梯度降低後，$||w||_2 > s$ 時，將其值限制爲s.

3. Datasets and Experimental Setup

3.1 Datasets:

1. MR: Movie reviews with one sentence per review. positive/negative reviews

2. SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).4

3. SST-2: Same as SST-1 but with neutral reviews removed and binary labels.

4. Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004)

5. TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002)

6. CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)

7. MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).

3.2 Hyperparameters and Training

激活函數：ReLU

window(h): 3,4,5, 每一個有100個feature map

dropout p = 0.5

l2(s) = 3

mini-batch size = 50

在SST-2的dev set上進行網格搜索(grid search)選擇的以上超參數。

批量梯度降低

使用Adadelta update rule

對於沒有提供標準dev set的數據集，隨機在training data 中選10%做爲dev set.

3.3 Pre-trained Word Vectors

word2vec vectors that were trained on 100 billion words from Google News

3.4 Model Variations

paper中提供的幾種模型的變型主要爲了測試，初始的word vector的設置對模型效果的影響。

CNN-rand: 徹底隨機初始化

CNN-static: 用word2vec預訓練的初始化

CNN-non-static: 用針對特定任務fine-tuned的

CNN-multichannel: 將static與fine-tuned的結合，每一個做爲一個channel

效果：後三者相比於徹底rand的在7個數據集上效果都有提高。

而且本文所提出的這個簡單的CNN模型的效果，和一些利用parse-tree等複雜模型的效果相差很小。在SST-2, CR 中取得了SOTA.

本文提出multichannel的方法，本想但願經過避免overfitting來提高效果的，可是實驗結果顯示，並無顯示處徹底的優點，在一些數據集上的效果，不及其餘。

4. Code

Theano: 1. paper的實現代碼：yoonkim/CNN_sentence: https://github.com/yoonkim/CNN_sentence

Tensorflow: 2. dennybritz/cnn-text-classification-tf: https://github.com/dennybritz/cnn-text-classification-tf

Keras: 3. alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras: https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

Pytorch: 4. Shawn1993/cnn-text-classification-pytorch: https://github.com/Shawn1993/cnn-text-classification-pytorch

試驗了MR的效果，eval準確率最高爲73%，低於github中給出的77.5%和paper中76.1%的準確率；

試驗了SST的效果，eval準確率最高爲37%，低於github中給出的37.2%和paper中45.0%的準確率。

這裏展現model.py的代碼：

 1 import torch
 2 import torch.nn as nn
 3 import torch.nn.functional as F
 4 from torch.autograd import Variable
 5 
 6 
 7 class CNN_Text(nn.Module):
 8     
 9     def __init__(self, args):
10         super(CNN_Text, self).__init__()
11         self.args = args
12         
13         V = args.embed_num
14         D = args.embed_dim
15         C = args.class_num
16         Ci = 1
17         Co = args.kernel_num
18         Ks = args.kernel_sizes
19 
20         self.embed = nn.Embedding(V, D)
21         # self.convs1 = [nn.Conv2d(Ci, Co, (K, D)) for K in Ks]
22         self.convs1 = nn.ModuleList([nn.Conv2d(Ci, Co, (K, D)) for K in Ks])
23         '''
24         self.conv13 = nn.Conv2d(Ci, Co, (3, D))
25         self.conv14 = nn.Conv2d(Ci, Co, (4, D))
26         self.conv15 = nn.Conv2d(Ci, Co, (5, D))
27         '''
28         self.dropout = nn.Dropout(args.dropout)
29         self.fc1 = nn.Linear(len(Ks)*Co, C)
30 
31     def conv_and_pool(self, x, conv):
32         x = F.relu(conv(x)).squeeze(3)  # (N, Co, W)
33         x = F.max_pool1d(x, x.size(2)).squeeze(2)
34         return x
35 
36     def forward(self, x):
37         x = self.embed(x)  # (N, W, D)
38         
39         if self.args.static:
40             x = Variable(x)
41 
42         x = x.unsqueeze(1)  # (N, Ci, W, D)
43 
44         x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N, Co, W), ...]*len(Ks)
45 
46         x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)
47 
48         x = torch.cat(x, 1)
49 
50         '''
51         x1 = self.conv_and_pool(x,self.conv13) #(N,Co)
52         x2 = self.conv_and_pool(x,self.conv14) #(N,Co)
53         x3 = self.conv_and_pool(x,self.conv15) #(N,Co)
54         x = torch.cat((x1, x2, x3), 1) # (N,len(Ks)*Co)
55         '''
56         x = self.dropout(x)  # (N, len(Ks)*Co)
57         logit = self.fc1(x)  # (N, C)
58         return logit

Pytorch 5. prakashpandey9/Text-Classification-Pytorch: https://github.com/prakashpandey9/Text-Classification-Pytorch

注意，該代碼中models的CNN部分是paper的簡單實現，可是代碼的main.py須要有修改

因爲選用的是IMDB的數據集，其label是1,2，而pytorch在計算loss時，要求target的範圍在0<= t < n_classes，也就是須要將標籤(1,2)轉換爲(0,1)，使其符合pytorch的要求，不然會報錯：「Assertion `t >= 0 && t < n_classes` failed.」

能夠經過將標籤2改成0，來實現：

1 target = (target != 2)
2 target = target.long()

應爲該代碼中用的損失函數是cross_entropy, 因此應轉爲long類型。

方便起見，這裏展現修改後的完整的main.py的代碼，裏面的超參數能夠自行更改。

  1 import os
  2 import time
  3 import load_data
  4 import torch
  5 import torch.nn.functional as F
  6 from torch.autograd import Variable
  7 import torch.optim as optim
  8 import numpy as np
  9 from models.LSTM import LSTMClassifier
 10 from models.CNN import CNN
 11 
 12 TEXT, vocab_size, word_embeddings, train_iter, valid_iter, test_iter = load_data.load_dataset()
 13 
 14 def clip_gradient(model, clip_value):
 15     params = list(filter(lambda p: p.grad is not None, model.parameters()))
 16     for p in params:
 17         p.grad.data.clamp_(-clip_value, clip_value)
 18     
 19 def train_model(model, train_iter, epoch):
 20     total_epoch_loss = 0
 21     total_epoch_acc = 0
 22 
 23     device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
 24     # model.cuda()
 25     # model.to(device)
 26 
 27     optim = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()))
 28     steps = 0
 29     model.train()
 30     for idx, batch in enumerate(train_iter):
 31         text = batch.text[0]
 32         target = batch.label
 33         ##########Assertion `t >= 0 && t < n_classes` failed.###################      
 34         target = (target != 2)
 35         target = target.long()
 36         ########################################################################
 37         # target = torch.autograd.Variable(target).long()
 38 
 39         if torch.cuda.is_available():
 40             text = text.cuda()
 41             target = target.cuda()
 42 
 43         if (text.size()[0] is not 32):# One of the batch returned by BucketIterator has length different than 32.
 44             continue
 45         optim.zero_grad()
 46         prediction = model(text)
 47         
 48         prediction.to(device)
 49 
 50         loss = loss_fn(prediction, target)
 51         loss.to(device)
 52 
 53         num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).float().sum()
 54         acc = 100.0 * num_corrects/len(batch)
 55 
 56         loss.backward()
 57         clip_gradient(model, 1e-1)
 58         optim.step()
 59         steps += 1
 60         
 61         if steps % 100 == 0:
 62             print (f'Epoch: {epoch+1}, Idx: {idx+1}, Training Loss: {loss.item():.4f}, Training Accuracy: {acc.item(): .2f}%')
 63         
 64         total_epoch_loss += loss.item()
 65         total_epoch_acc += acc.item()
 66         
 67     return total_epoch_loss/len(train_iter), total_epoch_acc/len(train_iter)
 68 
 69 def eval_model(model, val_iter):
 70     total_epoch_loss = 0
 71     total_epoch_acc = 0
 72     model.eval()
 73     with torch.no_grad():
 74         for idx, batch in enumerate(val_iter):
 75             text = batch.text[0]
 76             if (text.size()[0] is not 32):
 77                 continue
 78             target = batch.label
 79             # target = torch.autograd.Variable(target).long()
 80 
 81             target = (target != 2)
 82             target = target.long()
 83 
 84 
 85             if torch.cuda.is_available():
 86                 text = text.cuda()
 87                 target = target.cuda()
 88 
 89             prediction = model(text)
 90             loss = loss_fn(prediction, target)
 91             num_corrects = (torch.max(prediction, 1)[1].view(target.size()).data == target.data).sum()
 92             acc = 100.0 * num_corrects/len(batch)
 93             total_epoch_loss += loss.item()
 94             total_epoch_acc += acc.item()
 95 
 96     return total_epoch_loss/len(val_iter), total_epoch_acc/len(val_iter)
 97     
 98 
 99 # learning_rate = 2e-5
100 # batch_size = 32
101 # output_size = 2
102 # hidden_size = 256
103 # embedding_length = 300
104 
105 learning_rate = 1e-3
106 batch_size = 32
107 output_size = 1
108 # hidden_size = 256
109 embedding_length = 300
110 
111 # model = LSTMClassifier(batch_size, output_size, hidden_size, vocab_size, embedding_length, word_embeddings)
112 
113 model = CNN(batch_size = batch_size, output_size = 2, in_channels = 1, out_channels = 100, kernel_heights = [3,4,5], stride = 1, padding = 0, keep_probab = 0.5, vocab_size = vocab_size, embedding_length = 300, weights = word_embeddings)
114 
115 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
116 model.to(device)
117 
118 loss_fn = F.cross_entropy
119 
120 for epoch in range(1):
121     train_loss, train_acc = train_model(model, train_iter, epoch)
122     val_loss, val_acc = eval_model(model, valid_iter)
123     
124     print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc:.2f}%, Val. Loss: {val_loss:3f}, Val. Acc: {val_acc:.2f}%')
125     
126 test_loss, test_acc = eval_model(model, test_iter)
127 print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc:.2f}%')
128 
129 ''' Let us now predict the sentiment on a single sentence just for the testing purpose. '''
130 test_sen1 = "This is one of the best creation of Nolan. I can say, it's his magnum opus. Loved the soundtrack and especially those creative dialogues."
131 test_sen2 = "Ohh, such a ridiculous movie. Not gonna recommend it to anyone. Complete waste of time and money."
132 
133 test_sen1 = TEXT.preprocess(test_sen1)
134 test_sen1 = [[TEXT.vocab.stoi[x] for x in test_sen1]]
135 
136 test_sen2 = TEXT.preprocess(test_sen2)
137 test_sen2 = [[TEXT.vocab.stoi[x] for x in test_sen2]]
138 
139 test_sen = np.asarray(test_sen2)
140 test_sen = torch.LongTensor(test_sen)
141 
142 # test_tensor = Variable(test_sen, volatile=True)
143 
144 # test_tensor = torch.tensor(test_sen, dtype= torch.long)
145 # test_tensor.new_tensor(test_sen, requires_grad = False)
146 test_tensor = test_sen.clone().detach().requires_grad_(False)
147 
148 test_tensor = test_tensor.cuda()
149 
150 model.eval()
151 output = model(test_tensor, 1)
152 output = output.cuda()
153 out = F.softmax(output, 1)
154 
155 if (torch.argmax(out[0]) == 0):
156     print ("Sentiment: Positive")
157 else:
158     print ("Sentiment: Negative")

View Code

[支付寶] Bless you~ O(∩_∩)O

As you start to walk out on the way, the way appears. ~Rumi

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。