爲了顯示代碼的友好性,我會把代碼的每一步運行的結果顯示出來,讓讀者能夠有一個直觀的認識,瞭解每一步代碼的意思,下面我會先以幾條數據爲例,讓讀者能夠直觀的認識每段代碼執行出來的效果,文章末我會已一個大數據集實驗,而且給出實驗效果,讀者能夠參考網絡
1、 首先,筆者的數據存放在兩個excel,一個是存放的是pos評論,一個是neg評論。分別是poss.xlsx和negg.xlsx,裏面的內容以下:app
poss.xls的內容是:框架
neg.xls的內容是:dom
2、 而後,讀入數據了,具體代碼以下ide
import numpy as np import pandas as pd pos = pd.read_excel('poss.xlsx', header=None)#讀入數據到pandas數據框架 pos['label'] = 1#添加標籤列爲1 neg = pd.read_excel('negg.xlsx', header=None) neg['label'] = 0#添加標籤列爲0 all= pos.append(neg, ignore_index=True)#合併預料
print(all) 這段代碼運行的效果是這樣的:大數據
接下來是分詞了lua
cw=lambda s: list(jieba.cut(s))#調用結巴分詞 all['words'] = all[0].apply(cw)
print(all['words'])這段代碼運行的效果是這樣的:spa
把全部的詞組成一個大的詞典excel
all['words'] = all[0].apply(cw) content = [] for i in all['words']: content.extend(i) abc = pd.Series(content).value_counts()
給每一個詞一個固定的編號code
abc[:] = range(1, len(abc)+1) abc[''] = 0 maxlen=10 def doc2num(s, maxlen): s = [i for i in s if i in abc.index] s = s[:maxlen] + ['']*max(0, maxlen-len(s)) return list(abc[s]) all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen))
結果以下:
打亂數據,而且生成keras的輸入數據
idx = range(len(all)) np.random.shuffle(idx) all= all.loc[idx] x = np.array(list(all['doc2num'])) y = np.array(list(all['label'])) y = y.reshape((-1,1))
首先,咱們看下x裏面的數據形式,以下圖:
接下來就是用keras搭建卷積神經網絡模型了
model = Sequential() model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen)) model.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu')) model.add(GlobalMaxPooling1D()) model.add(Dense(128)) model.add(Dropout(0.2)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, y_test))
最後就是對1000條積極評論和1000條消極評論的情感分類代碼了,代碼以下:
from __future__ import print_functionimport jieba import pandas as pd import numpy as np np.random.seed(1337) # for reproducibility from keras.preprocessing import sequence from keras.utils import np_utils from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Embedding from keras.layers import Convolution1D, GlobalMaxPooling1D embedding_vecor_length = 32 maxlen =200 min_count=5 batch_size = 32 nb_epoch =10 nb_filter =128 filter_length = 3 pos = pd.read_excel('poss.xls', header=None) pos['label'] = 1 neg = pd.read_excel('negg.xls', header=None) neg['label'] = 0 all= pos.append(neg, ignore_index=True) cw=lambda s: list(jieba.cut(s)) all['words'] = all[0].apply(cw) content = [] for i in all['words']: content.extend(i) abc = pd.Series(content).value_counts() abc[:] = range(1, len(abc)+1) abc[''] = 0 def doc2num(s, maxlen): s = [i for i in s if i in abc.index] s = s[:maxlen] + ['']*max(0, maxlen-len(s)) return list(abc[s]) all['doc2num'] = all['words'].apply(lambda s: doc2num(s, maxlen)) idx = range(len(all)) np.random.shuffle(idx) all= all.loc[idx] x = np.array(list(all['doc2num'])) y = np.array(list(all['label'])) y = y.reshape((-1,1)) train_num=1600 X_train=x[:train_num] y_train=y[:train_num] X_test=x[train_num:] y_test=y[train_num:] model = Sequential() model.add(Embedding(len(abc), embedding_vecor_length,input_length=maxlen)) model.add(Convolution1D(nb_filter=nb_filter, filter_length=filter_length, border_mode='valid', activation='relu')) model.add(GlobalMaxPooling1D()) model.add(Dense(128)) model.add(Dropout(0.2)) model.add(Activation('relu')) model.add(Dense(1)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(X_test, y_test)) score, acc = model.evaluate(X_test, y_test,verbose=0) print('Test score:', score) print('Test accuracy:', acc) print('Train...') model.fit(X_train, y_train, batch_size=batch_size,nb_epoch=nb_epoch,validation_data=(X_test, y_test)) score, acc = model.evaluate(X_test, y_test,verbose=0) print('Test score:', score) print('Test accuracy:', acc)
結果以下: