基於pytorch的CNN、LSTM神經網絡模型調參小結

時間 2019-11-12

標籤基於 pytorch cnn lstm 神經網絡模型小結简体版

原文原文鏈接

（Demo）

這是最近兩個月來的一個小總結，實現的demo已經上傳github，裏面包含了CNN、LSTM、BiLSTM、GRU以及CNN與LSTM、BiLSTM的結合還有多層多通道CNN、LSTM、BiLSTM等多個神經網絡模型的的實現。這篇文章總結一下最近一段時間遇到的問題、處理方法和相關策略，以及經驗（其實並無什麼經驗）等，白菜一枚。html
Demo Site: https://github.com/bamtercelboo/cnn-lstm-bilstm-deepcnn-clstm-in-pytorchgit

（一） Pytorch簡述

Pytorch是一個較新的深度學習框架，是一個 Python 優先的深度學習框架，可以在強大的 GPU 加速基礎上實現張量和動態神經網絡。github
對於沒有學習過pytorch的初學者，能夠先看一下官網發行的60分鐘入門pytorch，參考地址：http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html網絡

（二） CNN、LSTM

卷積神經網絡CNN理解參考（https://www.zybuluo.com/hanbingtao/note/485480）app
長短時記憶網絡LSTM理解參考（https://zybuluo.com/hanbingtao/note/581764）框架

（三）數據預處理

　　一、我如今使用的語料是基本規範的數據（例以下），可是加載語料數據的過程當中仍然存在着一些須要預處理的地方，像一些數據的大小寫、數字的處理以及「\n \t」等一些字符，如今使用torchtext第三方庫進行加載數據預處理。dom

You Should Pay Nine Bucks for This : Because you can hear about suffering Afghan refugees on the news and still be unaffected . ||| 2
Dramas like this make it human . ||| 4

View Code

　　二、torch創建詞表、處理語料數據的大小寫：ide

import torchtext.data as data
# lower word
text_field = data.Field(lower=True)

View Code

　　三、處理語料數據數字等特殊字符：函數

 1 from torchtext import data
 2       def clean_str(string):
 3             string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
 4             string = re.sub(r"\'s", " \'s", string)
 5             string = re.sub(r"\'ve", " \'ve", string)
 6             string = re.sub(r"n\'t", " n\'t", string)
 7             string = re.sub(r"\'re", " \'re", string)
 8             string = re.sub(r"\'d", " \'d", string)
 9             string = re.sub(r"\'ll", " \'ll", string)
10             string = re.sub(r",", " , ", string)
11             string = re.sub(r"!", " ! ", string)
12             string = re.sub(r"\(", " \( ", string)
13             string = re.sub(r"\)", " \) ", string)
14             string = re.sub(r"\?", " \? ", string)
15             string = re.sub(r"\s{2,}", " ", string)
16             return string.strip()
17 
18         text_field.preprocessing = data.Pipeline(clean_str)

View Code

　　四、須要注意的地方：學習

加載數據集的時候可使用random打亂數據

1 if shuffle:
2     random.shuffle(examples_train)
3     random.shuffle(examples_dev)
4     random.shuffle(examples_test)

View Code

torchtext創建訓練集、開發集、測試集迭代器的時候，能夠選擇在每次迭代的時候是否去打亂數據

 1 class Iterator(object):
 2     """Defines an iterator that loads batches of data from a Dataset.
 3 
 4     Attributes:
 5         dataset: The Dataset object to load Examples from.
 6         batch_size: Batch size.
 7         sort_key: A key to use for sorting examples in order to batch together
 8             examples with similar lengths and minimize padding. The sort_key
 9             provided to the Iterator constructor overrides the sort_key
10             attribute of the Dataset, or defers to it if None.
11         train: Whether the iterator represents a train set.
12         repeat: Whether to repeat the iterator for multiple epochs.
13         shuffle: Whether to shuffle examples between epochs.
14         sort: Whether to sort examples according to self.sort_key.
15             Note that repeat, shuffle, and sort default to train, train, and
16             (not train).
17         device: Device to create batches on. Use -1 for CPU and None for the
18             currently active GPU device.
19     """

View Code

（四）Word Embedding

　　一、word embedding簡單來講就是語料中每個單詞對應的其相應的詞向量，目前訓練詞向量的方式最使用的應該是word2vec（參考 http://www.cnblogs.com/bamtercelboo/p/7181899.html）

　　二、上文中已經經過torchtext創建了相關的詞彙表，加載詞向量有兩種方式，一個是加載外部根據語料訓練好的預訓練詞向量，另外一個方式是隨機初始化詞向量，兩種方式相互比較的話當時是使用預訓練好的詞向量效果會好不少，可是本身訓練的詞向量並不見得會有很好的效果，由於語料數據可能不足，像已經訓練好的詞向量，像Google News那個詞向量，是業界公認的詞向量，可是因爲數量巨大，若是硬件設施（GPU）不行的話，仍是不要去嘗試這個了。

　　三、提供幾個下載預訓練詞向量的地址

word2vec-GoogleNews-vectors(https://github.com/mmihaltz/word2vec-GoogleNews-vectors)
glove-vectors (https://nlp.stanford.edu/projects/glove/)

　　四、加載外部詞向量方式

加載詞彙表中在詞向量裏面可以找到的詞向量

 1 # load word embedding
 2 def load_my_vecs(path, vocab, freqs):
 3     word_vecs = {}
 4     with open(path, encoding="utf-8") as f:
 5         count  = 0
 6         lines = f.readlines()[1:]
 7         for line in lines:
 8             values = line.split(" ")
 9             word = values[0]
10             # word = word.lower()
11             count += 1
12             if word in vocab:  # whether to judge if in vocab
13                 vector = []
14                 for count, val in enumerate(values):
15                     if count == 0:
16                         continue
17                     vector.append(float(val))
18                 word_vecs[word] = vector
19     return word_vecs

View Code

處理詞彙表中在詞向量裏面找不到的word，俗稱OOV(out of vocabulary)，OOV越多，可能對加過的影響也就越大，因此對OOV詞的處理就顯得尤其關鍵，如今有幾種策略能夠參考：
對已經找到的詞向量平均化

 1 # solve unknown by avg word embedding
 2 def add_unknown_words_by_avg(word_vecs, vocab, k=100):
 3     # solve unknown words inplaced by zero list
 4     word_vecs_numpy = []
 5     for word in vocab:
 6         if word in word_vecs:
 7             word_vecs_numpy.append(word_vecs[word])
 8     print(len(word_vecs_numpy))
 9     col = []
10     for i in range(k):
11         sum = 0.0
12         # for j in range(int(len(word_vecs_numpy) / 4)):
13         for j in range(int(len(word_vecs_numpy))):
14             sum += word_vecs_numpy[j][i]
15             sum = round(sum, 6)
16         col.append(sum)
17     zero = []
18     for m in range(k):
19         # avg = col[m] / (len(col) * 5)
20         avg = col[m] / (len(word_vecs_numpy))
21         avg = round(avg, 6)
22         zero.append(float(avg))
23 
24     list_word2vec = []
25     oov = 0
26     iov = 0
27     for word in vocab:
28         if word not in word_vecs:
29             # word_vecs[word] = np.random.uniform(-0.25, 0.25, k).tolist()
30             # word_vecs[word] = [0.0] * k
31             oov += 1
32             word_vecs[word] = zero
33             list_word2vec.append(word_vecs[word])
34         else:
35             iov += 1
36             list_word2vec.append(word_vecs[word])
37     print("oov count", oov)
38     print("iov count", iov)
39     return list_word2vec

View Code

隨機初始化或者所有取zero,隨機初始化或者是取zero,能夠是全部的OOV都使用一個隨機值，也能夠每個OOV word都是隨機的，具體效果看本身效果
隨機初始化的值看過幾篇論文，有的隨機初始化是在(-0.25,0.25)或者是(-0.1,0.1)之間，具體的效果能夠本身去測試一下，不一樣的數據集，不一樣的外部詞向量估計效果不同，我測試的結果是0.25要好於0.1

 1 # solve unknown word by uniform(-0.25,0.25)
 2 def add_unknown_words_by_uniform(word_vecs, vocab, k=100):
 3     list_word2vec = []
 4     oov = 0
 5     iov = 0
 6     # uniform = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
 7     for word in vocab:
 8         if word not in word_vecs:
 9             oov += 1
10             word_vecs[word] = np.random.uniform(-0.25, 0.25, k).round(6).tolist()
11             # word_vecs[word] = np.random.uniform(-0.1, 0.1, k).round(6).tolist()
12             # word_vecs[word] = uniform
13             list_word2vec.append(word_vecs[word])
14         else:
15             iov += 1
16             list_word2vec.append(word_vecs[word])
17     print("oov count", oov)
18     print("iov count", iov)
19     return list_word2vec

View Code

特別須要注意處理後的OOV詞向量是否在必定的範圍以內，這個必定要在處理以後手動或者是demo查看一下，想處理出來的詞向量大於15,30的這種，可能就是你本身處理方式的問題，也能夠是說是你本身demo可能存在bug，對結果的影響很大。

　　五、model中使用外部詞向量

1         if args.word_Embedding:
2             pretrained_weight = np.array(args.pretrained_weight)
3             self.embed.weight.data.copy_(torch.from_numpy(pretrained_weight))

View Code

（五）參數初始化

對於pytorch中的nn.Conv2d()卷積函數來講，有weight and bias，對weight初始化是頗有必要的，不對其初始化可能減慢收斂速度，影響最終效果等
對weight初始化，通常可使用torch.nn.init.uniform()、torch.nn.init.normal()、torch.nn.init.xavier_uniform()，具體使用參考 http://pytorch.org/docs/master/nn.html#torch-nn-init

1 init.xavier_normal(conv.weight.data, gain=np.sqrt(args.init_weight_value))
2 init.uniform(conv.bias, 0, 0)

View Code

對於pytorch中的nn.LSTM()，有all_weights屬性，其中包括weight and bias,是一個多維矩陣

1 if args.init_weight:
2     print("Initing W .......")
3     init.xavier_normal(self.bilstm.all_weights[0][0], gain=np.sqrt(args.init_weight_value))
4     init.xavier_normal(self.bilstm.all_weights[0][1], gain=np.sqrt(args.init_weight_value))
5     init.xavier_normal(self.bilstm.all_weights[1][0], gain=np.sqrt(args.init_weight_value))
6     init.xavier_normal(self.bilstm.all_weights[1][1], gain=np.sqrt(args.init_weight_value))

View Code

（六）調參及其策略

神經網絡參數設置
CNN中的kernel-size：看過一篇paper（A Sensitivity Analysis of (and Practitioners’ Guide to)Convolutional Neural Networks for Sentence Classification），論文上測試了kernel的使用，根據其結果，設置大部分會在1-10隨機組合，具體的效果還好根據本身的任務。
CNN中的kernel-num,就是每一個卷積窗口的特徵數目，大體設置在100-600，我通常會設置200,300
Dropout：Dropout大多數論文上設置都是0.5，聽說0.5的效果很好，可以防止過擬合問題，可是在不一樣的task中，還須要適當的調整dropout的大小，出來要調整dropout值以外，dropout在model中的位置也是很關鍵的，能夠嘗試不一樣的dropout位置，或許會收到驚人的效果。
batch size：batch size這個仍是須要去適當調整的，看相關的blogs，通常設置不會超過128，有可能也很小，在我目前的任務中，batch size =16有不錯的效果。
learning rate：學習率這個通常初值對於不一樣的優化器設置是不同的，聽說有一些經典的配置，像Adam ：lr = 0.001
迭代次數：根據本身的task、model、收斂速度、擬合效果設置不一樣的值
LSTM中的hidden size：LSTM中的隱藏層維度大小也對結果有必定的影響，若是使用300dim的外部詞向量的話，能夠考慮hidden size =150或者是300，對於hidden size我最大設置過600，由於硬件設備的緣由，600訓練起來已是很慢了，若是硬件資源ok的話，能夠嘗試更多的hidden size值，可是嘗試的過程當中仍是要考慮一下hidden size 與詞向量維度的關係（自認爲其是有必定的關係影響的）
二範式約束：pytorch中的Embedding中的max-norm 和norm-type就是二範式約束

1 if args.max_norm is not None:
2     print("max_norm = {} ".format(args.max_norm))
3     self.embed = nn.Embedding(V, D, max_norm=args.max_norm)

View Code

pytorch中實現了L2正則化，也叫作權重衰減，具體實現是在優化器中，參數是 weight_decay（pytorch中的L1正則已經被遺棄了，能夠本身實現），通常設置1e-8

1 if args.Adam is True:
2     print("Adam Training......")
3     optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.init_weight_decay)

View Code

梯度消失、梯度爆炸問題

1 import torch.nn.utils as utils
2 if args.init_clip_max_norm is not None:
3     utils.clip_grad_norm(model.parameters(),
4 max_norm=args.init_clip_max_norm)

View Code

神經網絡提高Acc的策略
數據預處理，創建詞彙表的過程當中能夠把詞頻爲1的單詞剔除，這也是一個超參數，若是剔除以後發現準確率降低的話，能夠嘗試以必定的機率剔除或者是以必定的機率對這部分詞向量進行不一樣的處理
動態學習率：pytorch最新的版本0.2已經實現了動態學習率，具體使用參考 http://pytorch.org/docs/master/optim.html#how-to-adjust-learning-rate
批量歸一化（batch normalizations）,pytorch中也提供了相應的函數 BatchNorm1d() 、BatchNorm2d() 能夠直接使用，其中有一個參數（momentum）能夠做爲超參數調整

1 if args.batch_normalizations is True:
2     print("using batch_normalizations in the model......")
3     self.convs1_bn = nn.BatchNorm2d(num_features=Co, momentum=args.bath_norm_momentum,
4                                             affine=args.batch_norm_affine)
5     self.fc1_bn = nn.BatchNorm1d(num_features=in_fea//2, momentum=args.bath_norm_momentum,
6                                          affine=args.batch_norm_affine)
7     self.fc2_bn = nn.BatchNorm1d(num_features=C, momentum=args.bath_norm_momentum,
8                                          affine=args.batch_norm_affine)

View Code

寬卷積、窄卷積，在深層卷積model中應該須要使用的是寬卷積，使用窄卷積的話會出現維度問題，我如今使用的數據使用雙層卷積神經網絡就會出現維度問題，其實也是和數據相關的

1 if args.wide_conv is True:
2     print("using wide convolution")
3     self.convs1 = [nn.Conv2d(in_channels=Ci, out_channels=Co, kernel_size=(K, D), stride=(1, 1),
4                                      padding=(K//2, 0), dilation=1, bias=True) for K in Ks]
5 else:
6     print("using narrow convolution")
7     self.convs1 = [nn.Conv2d(in_channels=Ci, out_channels=Co, kernel_size=(K, D), bias=True) for K in Ks]

View Code

character-level的處理，最開始的處理方式是使用詞進行處理（也就是單詞），能夠考慮根據字符去劃分，劃分出來的詞向量能夠採用隨機初始化的方式，這也是一種策略，我試過這種策略，對我目前的任務來講是沒有提高的。
優化器：pytorch提供了多個優化器，咱們最經常使用的是Adam，效果仍是很不錯的，具體的能夠參考 http://pytorch.org/docs/master/optim.html#algorithms
fine-tune or no-fine-tune：這是一個很重要的策略，通常狀況下fine-tune是有很不錯的效果的相對於no-fine-tune來講。

（七）參考致謝

（END）歡迎各位轉載，但請指明出處 bamtercelboo

1 if args.init_weight:
2     print("Initing W .......")
3     init.xavier_normal(self.bilstm.all_weights[0][0], gain=np.sqrt(args.init_weight_value))
4     init.xavier_normal(self.bilstm.all_weights[0][1], gain=np.sqrt(args.init_weight_value))
5     init.xavier_normal(self.bilstm.all_weights[1][0], gain=np.sqrt(args.init_weight_value))
6     init.xavier_normal(self.bilstm.all_weights[1][1], gain=np.sqrt(args.init_wei

1 if shuffle:
2     random.shuffle(examples_train)
3     random.shuffle(examples_dev)
4     random.shuffle(examples，

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

基於pytorch的CNN、LSTM神經網絡模型調參小結

（Demo）

（一） Pytorch簡述

（二） CNN、LSTM

（三）數據預處理

（四）Word Embedding

四、加載外部詞向量方式

處理詞彙表中在詞向量裏面找不到的word，俗稱OOV(out of vocabulary)，OOV越多，可能對加過的影響也就越大，因此對OOV詞的處理就顯得尤其關鍵，如今有幾種策略能夠參考：

五、model中使用外部詞向量

（五）參數初始化

（六）調參及其策略

神經網絡參數設置

神經網絡提高Acc的策略

（七）參考致謝

（END）歡迎各位轉載，但請指明出處 bamtercelboo

　　四、加載外部詞向量方式

　　五、model中使用外部詞向量