情感分析(Sentiment analysis,SA),它是天然語言處理(NLP)領域的一個分支,又稱傾向性分析,意見抽取(Opinion extraction),意見挖掘(Opinion mining),情感挖掘(Sentiment mining),主觀分析(Subjectivity analysis)等,它是對帶有情感色彩的主觀性文本進行分析,處理,概括和推理的過程.python
接下來主要講解:基於LSTM網絡,用Tensorflow實現一個情感分析實例,具體內容以下:git
天然語言處理就是告訴機器如何處理或讀懂人類語言,目前熱門方向包括:github
情感分析的本質就是根據已知的文字和情感符號,推測文字是正面的仍是負面的web
進行情感分析有以下難點:算法
不管是機器學習仍是深度學習,輸入數據都須要轉換爲計算機能識別的數字.卷積神經網絡使用像素做爲輸入,邏輯迴歸使用一些能夠量化的特徵值做爲輸入,強化學習模型使用獎勵信號來進行更新.當處理NLP任務時,一樣須要把文字轉換爲數字數組
from IPython.display import Image Image(filename="./data/17_01.png",width=500)
因此,咱們不能將字符串做爲模型的輸入,須要將句子中的每一個詞轉換成數字或向量.如何進行轉換呢?方法不少,可能最早想到的是每一個單詞用一個整數表示,這種方法簡單,但它沒法反應詞之間的依賴關係,也沒法表示近義詞或同義詞等.爲了克服這種方法的不足,有人把這些數字轉換成獨熱編碼(One-hot),這種方式能避免數據大權重也大的問題,但構成的矩陣通常很龐大,沒法反映一個句子上下文的依賴關係.後來人們想到一個比較好的方法,即經過Word2Vec算法或模型,把句子中的詞轉換爲維度不是很大(如在50至300之間)的向量.經過這種方法便可控制向量維度,又可體現句子或文章中的詞的上下關係網絡
Image(filename="./data/17_02.png",width=500)
Image(filename="./data/17_03.png",width=500)
Word2Vec模型先根據數據集中的每一個句子進行訓練,並以一個固定窗口在句子上進行滑動,根據句子上的上下文來預測固定窗口中間那個詞的向量session
有了神經網絡的輸入數據–詞向量以後,接下來看須要構建的神經網絡.時序性是NLP數據的一大特色,每一個單詞的出現都依賴於它的前一個單詞和後一個單詞.因爲這種依賴的存在,因此可以使用循環神經網絡來處理數據結構
在循環神經網絡中,句子中的每一個單詞都考慮了時間步長.實際上時間步長的數量將等於最大序列長度架構
Image(filename="./data/17_04.png",width=500)
遷移學習是一種極機器學習方法,簡單來講,就是把任務A開發的模型做爲初始點,從新用在任務B中.合理使用遷移學習,能夠避免針對每一個目標任務單獨訓練模型,從而極大節約計算資源
在計算機視覺任務和天然語言處理任務中,將預訓練好的模型做爲新模型的起點是一種經常使用方法,一般預訓練這些模型,每每要消耗大量的時間和巨大的計算資源.遷移學習就是把預訓練好的模型遷移到新的任務上
使用兩個已訓練好的詞向量模型:一個是包含40000個單詞的python列表(及Numpy對象),另外一個是包含40000*50詞向量的嵌入矩陣
情感分析的任務就是分析一個輸入單詞或者句子的情緒是積極的,仍是消極的.咱們把這個特定的任務分紅5個步驟來實現:
咱們用另一個現成但小一些的矩陣,該矩陣用GloVe進行訓練而得.該矩陣包含40000個詞向量,每一個向量的維數爲50
咱們將採用遷移方法,直接導入這兩個已訓練好的數據結構,一個是包含400000個單詞的python列表,一個是包含全部單詞向量值的400000*50維的嵌入矩陣
import numpy as np wordsList=np.load("./imdb/wordsList.npy") print("Load the word list!") # Originally loaded as numpy array wordsList=wordsList.tolist() # 將單詞編碼爲UTF-8格式 wordsList=[word.decode("UTF-8") for word in wordsList] wordVectors=np.load("./imdb/wordVectors.npy") print("Loaded the word vectors!")
Load the word list! Loaded the word vectors!
查看詞彙列表和嵌入矩陣的維度
print(len(wordsList)) print(wordVectors.shape)
400000 (400000, 50)
咱們也能夠在詞庫中搜索單詞,好比"baseball",而後能夠經過訪問嵌入矩陣來獲得相應的向量
baseballIndex=wordsList.index("baseball") wordVectors[baseballIndex]
array([-1.9327 , 1.0421 , -0.78515 , 0.91033 , 0.22711 , -0.62158 , -1.6493 , 0.07686 , -0.5868 , 0.058831, 0.35628 , 0.68916 , -0.50598 , 0.70473 , 1.2664 , -0.40031 , -0.020687, 0.80863 , -0.90566 , -0.074054, -0.87675 , -0.6291 , -0.12685 , 0.11524 , -0.55685 , -1.6826 , -0.26291 , 0.22632 , 0.713 , -1.0828 , 2.1231 , 0.49869 , 0.066711, -0.48226 , -0.17897 , 0.47699 , 0.16384 , 0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 , -0.59457 , 1.3566 , -0.27506 , 0.19918 , -0.36008 , 0.55667 , -0.70315 , 0.17157 ], dtype=float32)
有了向量以後,第一步就是輸入一個句子,而後構造它的向量表示.假設如今輸入的句子是"I thought the movie was incredible and inspiring".爲了獲得詞向量,可使用Tensorflow的嵌入函數.這個函數有兩個函數:一個是嵌入矩陣(這裏採用詞向量矩陣),另外一個是與每一個詞對應的索引.接下來,經過一個具體的例子來講明:
import tensorflow as tf # 句子的最大長度 maxSeqLength=10 # 單詞向量維度 numDimensions=300 firstSentence=np.zeros((maxSeqLength),dtype="int32") firstSentence[0]=wordsList.index("i") firstSentence[1]=wordsList.index("thought") firstSentence[2]=wordsList.index("the") firstSentence[3]=wordsList.index("movie") firstSentence[4]=wordsList.index("was") firstSentence[5]=wordsList.index("incredible") firstSentence[6]=wordsList.index("and") firstSentence[7]=wordsList.index("inspiring") # firstSentence[8] and firstSentence[9] 爲 0 print(firstSentence.shape) print(firstSentence)
(10,) [ 41 804 201534 1005 15 7446 5 13767 0 0]
Image(filename="./data/17_07.png",width=500)
with tf.Session() as session: print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py:132: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. (10, 50)
結果輸出數據是一個10*50的詞矩陣,其中包括10個詞,每一個詞的向量維度是50
例如新訓練集是IMDB數據集.這個數據集包含25000條電影數據,其中12500條正向數據,12500條反向數據.這些數據存儲在一個文本文件中,首先須要作的就是去解析這個文件.正向數據包含在一個文件中,負向數據包含在另外一個文件中
from os import listdir from os.path import isfile,join positiveFiles=["./imdb/positiveReviews/"+f for f in listdir("./imdb/" "positiveReviews/") if isfile(join("./imdb/positiveReviews/",f))] negativeFiles=["./imdb/negativeReviews/"+f for f in listdir("./imdb/" "negativeReviews/") if isfile(join("./imdb/negativeReviews/",f))] numWords=[] for pf in positiveFiles: with open(pf,"r",encoding="utf-8") as f: line=f.readline() counter=len(line.split()) numWords.append(counter) print("Positive files finished") for nf in negativeFiles: with open(nf,"r",encoding="utf-8") as f: line=f.readline() counter=len(line.split()) numWords.append(counter) print("Negative files finished") numFiles=len(numWords) print("The total number of files is",numFiles) print("The total number of words in the files is",sum(numWords)) print("The average number of words in the files is",sum(numWords)/len(numWords))
Positive files finished Negative files finished The total number of files is 25000 The total number of words in the files is 5844680 The average number of words in the files is 233.7872
使用Matplot對數據進行可視化
import matplotlib.pyplot as plt import matplotlib.font_manager as fm myfont=fm.FontProperties(fname="E:/Anaconda/envs/mytensorflow/Lib/site-packages/matplotlib/mpl-data/fonts/ttf/Simhei.ttf") %matplotlib inline plt.hist(numWords,50) plt.xlabel("序列長度",fontproperties=myfont) plt.ylabel("頻率",fontproperties=myfont) plt.axis([0,1200,0,8000]) plt.show()
從直方圖和句子的平均單詞數能夠看出,將句子最大長度設置爲250是合適的.接下來咱們將單個文件中的文本轉換成索引矩陣.下面的代碼就是文本中的一個評論
maxSeqLength=250 fname=positiveFiles[3] with open(fname) as f: for lines in f: print(lines) exit
This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).
將它轉換成一個索引矩陣:
import re # 刪除標點符號,括號,問號等,只留下字母數字字符 strip_special_chars=re.compile("[^A-Za-z0-9 ]+") def cleanSentences(string): string=string.lower().replace("<br />"," ") return re.sub(strip_special_chars,"",string.lower()) firstFile=np.zeros((maxSeqLength),dtype="int32") with open(fname) as f: indexCounter=0 line=f.readline() cleanedLine=cleanSentences(line) split=cleanedLine.split() for word in split: try: firstFile[indexCounter]=wordsList.index(word) except ValueError: firstFile[indexCounter]=399999 indexCounter=indexCounter+1 firstFile
array([ 37, 14, 2407, 201534, 96, 37314, 319, 7158, 201534, 6469, 8828, 1085, 47, 9703, 20, 260, 36, 455, 7, 7284, 1139, 3, 26494, 2633, 203, 197, 3941, 12739, 646, 7, 7284, 1139, 3, 11990, 7792, 46, 12608, 646, 7, 7284, 1139, 3, 8593, 81, 36381, 109, 3, 201534, 8735, 807, 2983, 34, 149, 37, 319, 14, 191, 31906, 6, 7, 179, 109, 15402, 32, 36, 5, 4, 2933, 12, 138, 6, 7, 523, 59, 77, 3, 201534, 96, 4246, 30006, 235, 3, 908, 14, 4702, 4571, 47, 36, 201534, 6429, 691, 34, 47, 36, 35404, 900, 192, 91, 4499, 14, 12, 6469, 189, 33, 1784, 1318, 1726, 6, 201534, 410, 41, 835, 10464, 19, 7, 369, 5, 1541, 36, 100, 181, 19, 7, 410, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
如今,咱們用相同的方法來處理所有的25000條評論.咱們導入電影訓練集,並獲得一個25000*250的矩陣.這是一個計算成本很是高的過程,能夠採用遷移學習方法,直接使用訓練好的索引矩陣文件
ids=np.load("./imdb/idsMatrix.npy")
定義輔助函數的方法以下:
from random import randint def getTrainBatch(): labels=[] arr=np.zeros([batchSize,maxSeqLength]) for i in range(batchSize): if (i%2==0): num=randint(1,11499) labels.append([1,0]) else: num=randint(13499,24999) labels.append([0,1]) arr[i]=ids[num-1:num] return arr,labels def getTestBatch(): labels=[] arr=np.zeros([batchSize,maxSeqLength]) for i in range(batchSize): num=randint(11499,13499) if (num<=12499): labels.append([1,0]) else: labels.append([0,1]) arr[i]=ids[num-1:num] return arr,labels
先定義一些超參數,好比批處理大小,LSTM的單元個數,分類類別和訓練次數等
batchSize=24 lstmUnits=64 numClasses=2 iterations=20000
咱們須要定義兩個佔位符,一個用於數據輸入,另外一個用於標籤數據.對於佔位符,最重要的一點就是肯定好維度
import tensorflow as tf tf.reset_default_graph() labels=tf.placeholder(tf.float32,[batchSize,numClasses]) input_data=tf.placeholder(tf.int32,[batchSize,maxSeqLength])
標籤佔位符表明一組值,每一個值都爲[1,0]或者[0,1],具體是哪一個取決於數據是正向的仍是反向的.輸入的佔位符是一個整數比的索引數組
from IPython.display import Image Image(filename="./data/17_06.png",width=500)
設置了輸入數據佔位符以後,咱們能夠調用tf.nn.embedding_lookup()
函數來獲得詞向量.該函數最後將返回一個三維向量,第一個維度是批處理大小,第二個維度是句子長度,第三個維度是詞向量長度
data=tf.nn.embedding_lookup(wordVectors,input_data)
Image(filename="./data/17_05.png",width=500)
如何才能將這種數據形式輸入到咱們的LSTM網絡中?首先,咱們使用tf.nn.rnn_cell.BasicLSTMCell
函數,這個函數輸入的參數是一個整數,表示須要幾個LSTM單元.這是咱們設置的一個超參數,咱們須要對這個數值進行調試從而找到最優的解.而後設置一個dropout參數,以此來避免一些過擬合.最後咱們將LSTM cell和三維的數據輸入到tf.nn.dynamic_rnn
,這個函數的功能是展開整個網絡,而且構建一個RNN模型
lstmCell=tf.contrib.rnn.BasicLSTMCell(lstmUnits) lstmCell=tf.contrib.rnn.DropoutWrapper(cell=lstmCell,output_keep_prob=0.25) value,_=tf.nn.dynamic_rnn(lstmCell,data,dtype=tf.float32)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:1: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:3: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell)`, which is equivalent to this API WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py:1259: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
LSTM能夠幫助模型記住更多的上下文信息,可是帶來的弊端是訓練參數會增長不少,模型的訓練時間會很長,過擬合的機率也會增長
dynamic RNN
函數的第一個輸出能夠被認爲是最後的隱藏狀態向量.這個向量將被從新肯定維度,而後乘以最後的權重矩陣和一個偏置項來得到最終的輸出值
weight=tf.Variable(tf.truncated_normal([lstmUnits,numClasses])) bias=tf.Variable(tf.constant(0.1,shape=[numClasses])) value=tf.transpose(value,[1,0,2]) last=tf.gather(value,int(value.get_shape()[0])-1) prediction=(tf.matmul(last,weight)+bias)
接下來,咱們須要定義正確的預測函數和正確率評估參數.正確的預測形式是查看最後輸出的0-1向量是否和標記的0-1向量相同
correctPred=tf.equal(tf.argmax(prediction,1),tf.argmax(labels,1)) accuracy=tf.reduce_mean(tf.cast(correctPred,tf.float32))
以後咱們使用一個標準的交叉熵做爲代價函數.對於優化器,咱們選擇Adam,而且採用默認的學習率:
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction,labels=labels)) optimizer=tf.train.AdamOptimizer().minimize(loss)
若是你想使用Tensorboard來可視化損失值和正確率,能夠修改而且運行下列代碼
import datetime tf.summary.scalar("Loss",loss) tf.summary.scalar("Accuracy",accuracy) merged=tf.summary.merge_all() logdir="./tensorboard/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/" sess=tf.Session() writer=tf.summary.FileWriter(logdir,sess.graph)
選擇合適的超參數來訓練你的神經網絡是相當重要的
訓練過程的基本思路是:首先定義一個Tensorflow會話;而後加載一批評論和對應的標籤;接下來調用會話的run函數.run函數有兩個參數:第一個參數被稱爲fetches
參數,這個參數定義了咱們感興趣的值.但願經過咱們的優化器來最小化代價函數;第二個參數被稱爲feed_dict
參數,這個數據結構就是佔位符,咱們須要將一個批處理的評論和標籤輸入模型,而後不斷對這一組訓練數據進行循環訓練
爲了獲取比較好的模型性能,迭代次數要比較大.這裏只迭代20000次(即iterations=20000),使用GPU
sess=tf.InteractiveSession() saver=tf.train.Saver() sess.run(tf.global_variables_initializer()) with tf.device("/gpu:0"): for i in range(iterations): # 獲取下一批次數據 nextBatch,nextBatchLabels=getTrainBatch() sess.run(optimizer,{input_data:nextBatch,labels:nextBatchLabels}) # 把彙總信息寫入Tensorboard if (i%50==0): summary=sess.run(merged,{input_data:nextBatch,labels:nextBatchLabels}) writer.add_summary(summary,i) # 每訓練1000次保存一次 if (i%1000==0 and i!=0): save_path=saver.save(sess,"./models/pretrained_lstm.ckpt",global_step=i) print("saved to %s"%save_path) writer.close()
saved to ./models/pretrained_lstm.ckpt-1000 saved to ./models/pretrained_lstm.ckpt-2000 saved to ./models/pretrained_lstm.ckpt-3000 saved to ./models/pretrained_lstm.ckpt-4000 saved to ./models/pretrained_lstm.ckpt-5000 WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to delete files with this prefix. saved to ./models/pretrained_lstm.ckpt-6000 saved to ./models/pretrained_lstm.ckpt-7000 saved to ./models/pretrained_lstm.ckpt-8000 saved to ./models/pretrained_lstm.ckpt-9000 saved to ./models/pretrained_lstm.ckpt-10000 saved to ./models/pretrained_lstm.ckpt-11000 saved to ./models/pretrained_lstm.ckpt-12000 saved to ./models/pretrained_lstm.ckpt-13000 saved to ./models/pretrained_lstm.ckpt-14000 saved to ./models/pretrained_lstm.ckpt-15000 saved to ./models/pretrained_lstm.ckpt-16000 saved to ./models/pretrained_lstm.ckpt-17000 saved to ./models/pretrained_lstm.ckpt-18000 saved to ./models/pretrained_lstm.ckpt-19000
上面的代碼已保存了模型,恢復一個預訓練的模型須要使用Tensorflow的另外一個會話函數–Server,而後使用利用這個會話函數來調用restore函數.這個函數包括兩個參數:一個表示當前的會話,另外一個表示保存的模型
sess=tf.InteractiveSession() saver=tf.train.Saver() saver.restore(sess,tf.train.latest_checkpoint("./models"))
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from ./models\pretrained_lstm.ckpt-19000
而後,從咱們的測試集中導入一些電影評論.你能夠經過如下代碼來查看每個批處理的準確率
iterations=10 for i in range(iterations): nextBatch,nextBarchLabels=getTestBatch(); print("Accuracy for thie batch:",(sess.run(accuracy,{input_data:nextBatch,labels:nextBatchLabels}))*100)
Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 33.33333432674408 Accuracy for thie batch: 37.5 Accuracy for thie batch: 54.16666865348816 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 54.16666865348816 Accuracy for thie batch: 45.83333432674408 Accuracy for thie batch: 50.0