情感分析(Sentiment analysis,SA),它是天然語言處理(NLP)領域的一個分支,又稱傾向性分析,意見抽取(Opinion extraction),意見挖掘(Opinion mining),情感挖掘(Sentiment mining),主觀分析(Subjectivity analysis)等,它是對帶有情感色彩的主觀性文本進行分析,處理,概括和推理的過程.python
import numpy as np wordsList=np.load("./imdb/wordsList.npy") print("Load the word list!") # Originally loaded as numpy array wordsList=wordsList.tolist() # 將單詞編碼爲UTF-8格式 wordsList=[word.decode("UTF-8") for word in wordsList] wordVectors=np.load("./imdb/wordVectors.npy") print("Loaded the word vectors!")
Load the word list! Loaded the word vectors!
print(len(wordsList)) print(wordVectors.shape)
400000 (400000, 50)
baseballIndex=wordsList.index("baseball") wordVectors[baseballIndex]
array([-1.9327 , 1.0421 , -0.78515 , 0.91033 , 0.22711 , -0.62158 , -1.6493 , 0.07686 , -0.5868 , 0.058831, 0.35628 , 0.68916 , -0.50598 , 0.70473 , 1.2664 , -0.40031 , -0.020687, 0.80863 , -0.90566 , -0.074054, -0.87675 , -0.6291 , -0.12685 , 0.11524 , -0.55685 , -1.6826 , -0.26291 , 0.22632 , 0.713 , -1.0828 , 2.1231 , 0.49869 , 0.066711, -0.48226 , -0.17897 , 0.47699 , 0.16384 , 0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 , -0.59457 , 1.3566 , -0.27506 , 0.19918 , -0.36008 , 0.55667 , -0.70315 , 0.17157 ], dtype=float32)
有了向量以後,第一步就是輸入一個句子,而後構造它的向量表示.假設如今輸入的句子是"I thought the movie was incredible and inspiring".爲了獲得詞向量,可使用Tensorflow的嵌入函數.這個函數有兩個函數:一個是嵌入矩陣(這裏採用詞向量矩陣),另外一個是與每一個詞對應的索引.接下來,經過一個具體的例子來講明:
import tensorflow as tf # 句子的最大長度 maxSeqLength=10 # 單詞向量維度 numDimensions=300 firstSentence=np.zeros((maxSeqLength),dtype="int32") firstSentence[0]=wordsList.index("i") firstSentence[1]=wordsList.index("thought") firstSentence[2]=wordsList.index("the") firstSentence[3]=wordsList.index("movie") firstSentence[4]=wordsList.index("was") firstSentence[5]=wordsList.index("incredible") firstSentence[6]=wordsList.index("and") firstSentence[7]=wordsList.index("inspiring") # firstSentence[8] and firstSentence[9] 爲 0 print(firstSentence.shape) print(firstSentence)
(10,) [ 41 804 201534 1005 15 7446 5 13767 0 0]
with tf.Session() as session: print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py:132: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. (10, 50)
from os import listdir from os.path import isfile,join positiveFiles=["./imdb/positiveReviews/"+f for f in listdir("./imdb/" "positiveReviews/") if isfile(join("./imdb/positiveReviews/",f))] negativeFiles=["./imdb/negativeReviews/"+f for f in listdir("./imdb/" "negativeReviews/") if isfile(join("./imdb/negativeReviews/",f))] numWords=[] for pf in positiveFiles: with open(pf,"r",encoding="utf-8") as f: line=f.readline() counter=len(line.split()) numWords.append(counter) print("Positive files finished") for nf in negativeFiles: with open(nf,"r",encoding="utf-8") as f: line=f.readline() counter=len(line.split()) numWords.append(counter) print("Negative files finished") numFiles=len(numWords) print("The total number of files is",numFiles) print("The total number of words in the files is",sum(numWords)) print("The average number of words in the files is",sum(numWords)/len(numWords))
Positive files finished Negative files finished The total number of files is 25000 The total number of words in the files is 5844680 The average number of words in the files is 233.7872
import matplotlib.pyplot as plt import matplotlib.font_manager as fm myfont=fm.FontProperties(fname="E:/Anaconda/envs/mytensorflow/Lib/site-packages/matplotlib/mpl-data/fonts/ttf/Simhei.ttf") %matplotlib inline plt.hist(numWords,50) plt.xlabel("序列長度",fontproperties=myfont) plt.ylabel("頻率",fontproperties=myfont) plt.axis([0,1200,0,8000]) plt.show()
maxSeqLength=250 fname=positiveFiles[3] with open(fname) as f: for lines in f: print(lines) exit
This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).
import re # 刪除標點符號,括號,問號等,只留下字母數字字符 strip_special_chars=re.compile("[^A-Za-z0-9 ]+") def cleanSentences(string): string=string.lower().replace("<br />"," ") return re.sub(strip_special_chars,"",string.lower()) firstFile=np.zeros((maxSeqLength),dtype="int32") with open(fname) as f: indexCounter=0 line=f.readline() cleanedLine=cleanSentences(line) split=cleanedLine.split() for word in split: try: firstFile[indexCounter]=wordsList.index(word) except ValueError: firstFile[indexCounter]=399999 indexCounter=indexCounter+1 firstFile
array([ 37, 14, 2407, 201534, 96, 37314, 319, 7158, 201534, 6469, 8828, 1085, 47, 9703, 20, 260, 36, 455, 7, 7284, 1139, 3, 26494, 2633, 203, 197, 3941, 12739, 646, 7, 7284, 1139, 3, 11990, 7792, 46, 12608, 646, 7, 7284, 1139, 3, 8593, 81, 36381, 109, 3, 201534, 8735, 807, 2983, 34, 149, 37, 319, 14, 191, 31906, 6, 7, 179, 109, 15402, 32, 36, 5, 4, 2933, 12, 138, 6, 7, 523, 59, 77, 3, 201534, 96, 4246, 30006, 235, 3, 908, 14, 4702, 4571, 47, 36, 201534, 6429, 691, 34, 47, 36, 35404, 900, 192, 91, 4499, 14, 12, 6469, 189, 33, 1784, 1318, 1726, 6, 201534, 410, 41, 835, 10464, 19, 7, 369, 5, 1541, 36, 100, 181, 19, 7, 410, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
from random import randint def getTrainBatch(): labels=[] arr=np.zeros([batchSize,maxSeqLength]) for i in range(batchSize): if (i%2==0): num=randint(1,11499) labels.append([1,0]) else: num=randint(13499,24999) labels.append([0,1]) arr[i]=ids[num-1:num] return arr,labels def getTestBatch(): labels=[] arr=np.zeros([batchSize,maxSeqLength]) for i in range(batchSize): num=randint(11499,13499) if (num<=12499): labels.append([1,0]) else: labels.append([0,1]) arr[i]=ids[num-1:num] return arr,labels
batchSize=24 lstmUnits=64 numClasses=2 iterations=20000
import tensorflow as tf tf.reset_default_graph() labels=tf.placeholder(tf.float32,[batchSize,numClasses]) input_data=tf.placeholder(tf.int32,[batchSize,maxSeqLength])
函數,這個函數輸入的參數是一個整數,表示須要幾個LSTM單元.這是咱們設置的一個超參數,咱們須要對這個數值進行調試從而找到最優的解.而後設置一個dropout參數,以此來避免一些過擬合.最後咱們將LSTM cell和三維的數據輸入到tf.nn.dynamic_rnn
lstmCell=tf.contrib.rnn.BasicLSTMCell(lstmUnits) lstmCell=tf.contrib.rnn.DropoutWrapper(cell=lstmCell,output_keep_prob=0.25) value,_=tf.nn.dynamic_rnn(lstmCell,data,dtype=tf.float32)
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:1: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:3: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use `keras.layers.RNN(cell)`, which is equivalent to this API WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py:1259: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
dynamic RNN
weight=tf.Variable(tf.truncated_normal([lstmUnits,numClasses])) bias=tf.Variable(tf.constant(0.1,shape=[numClasses])) value=tf.transpose(value,[1,0,2]) last=tf.gather(value,int(value.get_shape()[0])-1) prediction=(tf.matmul(last,weight)+bias)
correctPred=tf.equal(tf.argmax(prediction,1),tf.argmax(labels,1)) accuracy=tf.reduce_mean(tf.cast(correctPred,tf.float32))
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction,labels=labels)) optimizer=tf.train.AdamOptimizer().minimize(loss)
import datetime tf.summary.scalar("Loss",loss) tf.summary.scalar("Accuracy",accuracy) merged=tf.summary.merge_all() logdir="./tensorboard/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/" sess=tf.Session() writer=tf.summary.FileWriter(logdir,sess.graph)
sess=tf.InteractiveSession() saver=tf.train.Saver() sess.run(tf.global_variables_initializer()) with tf.device("/gpu:0"): for i in range(iterations): # 獲取下一批次數據 nextBatch,nextBatchLabels=getTrainBatch() sess.run(optimizer,{input_data:nextBatch,labels:nextBatchLabels}) # 把彙總信息寫入Tensorboard if (i%50==0): summary=sess.run(merged,{input_data:nextBatch,labels:nextBatchLabels}) writer.add_summary(summary,i) # 每訓練1000次保存一次 if (i%1000==0 and i!=0): save_path=saver.save(sess,"./models/pretrained_lstm.ckpt",global_step=i) print("saved to %s"%save_path) writer.close()
saved to ./models/pretrained_lstm.ckpt-1000 saved to ./models/pretrained_lstm.ckpt-2000 saved to ./models/pretrained_lstm.ckpt-3000 saved to ./models/pretrained_lstm.ckpt-4000 saved to ./models/pretrained_lstm.ckpt-5000 WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to delete files with this prefix. saved to ./models/pretrained_lstm.ckpt-6000 saved to ./models/pretrained_lstm.ckpt-7000 saved to ./models/pretrained_lstm.ckpt-8000 saved to ./models/pretrained_lstm.ckpt-9000 saved to ./models/pretrained_lstm.ckpt-10000 saved to ./models/pretrained_lstm.ckpt-11000 saved to ./models/pretrained_lstm.ckpt-12000 saved to ./models/pretrained_lstm.ckpt-13000 saved to ./models/pretrained_lstm.ckpt-14000 saved to ./models/pretrained_lstm.ckpt-15000 saved to ./models/pretrained_lstm.ckpt-16000 saved to ./models/pretrained_lstm.ckpt-17000 saved to ./models/pretrained_lstm.ckpt-18000 saved to ./models/pretrained_lstm.ckpt-19000
sess=tf.InteractiveSession() saver=tf.train.Saver() saver.restore(sess,tf.train.latest_checkpoint("./models"))
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. INFO:tensorflow:Restoring parameters from ./models\pretrained_lstm.ckpt-19000
iterations=10 for i in range(iterations): nextBatch,nextBarchLabels=getTestBatch(); print("Accuracy for thie batch:",(sess.run(accuracy,{input_data:nextBatch,labels:nextBatchLabels}))*100)
Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 33.33333432674408 Accuracy for thie batch: 37.5 Accuracy for thie batch: 54.16666865348816 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 41.66666567325592 Accuracy for thie batch: 54.16666865348816 Accuracy for thie batch: 45.83333432674408 Accuracy for thie batch: 50.0