【從傳統方法到深度學習】情感分析

時間 2019-11-06

原文原文鏈接

爲了記錄在競賽中入門深度學習的過程，我開了一個新系列【從傳統方法到深度學習】。html

1. 問題

Kaggle競賽Bag of Words Meets Bags of Popcorn是電影評論（review）的情感分析，能夠視做爲短文本的二分類問題（正向、負向）。標註數據集長這樣：python

id  sentiment   review
"2381_9"    1   "\"The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. ..."
"2486_3"    0   "What happens when an army of wetbacks, towelheads, and Godless Eastern European commies gather their forces south of the border? Gary Busey kicks their butts, of course. Another laughable example of Reagan-era cultural fallout, Bulletproof wastes a decent supporting cast headed by L Q Jones and Thalmus Rasulala."

評價指標是AUC。所以，在測試數據集上應該給出機率而不是類別；即爲predict_proba而不是predict：git

# random frorest
result = forest.predict_proba(test_data_features)[:, 1]
# not `predict`
result = forest.predict(test_data_features)

採用BoW特徵、RF (random forest)分類器，預測類別的AUC爲0.84436，預測機率的AUC則爲0.92154。github

2. 分析

傳統方法

傳統方法通常會使用到兩種特徵：BoW (bag of words)，n-gram。BoW忽略了詞序，只是單純對詞計數；而n-gram則是考慮到了詞序，好比bigram詞對"dog run"、"run dog"是兩個不一樣的特徵。BoW能夠用CountVectorizer向量化：app

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None,
                             stop_words=None, max_features=5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)

在一個句子中，不一樣的詞重要性是不一樣的；須要用TFIDF來給詞加權重。n-gram特徵則能夠用TfidfVectorizer向量化：less

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=40000, ngram_range=(1, 3), sublinear_tf=True)
train_x = vectorizer.fit_transform(clean_train_reviews)

使用unigram、bigram、trigram特徵 + RF分類器，AUC爲0.93058；若是改爲LR分類器，則AUC爲0.96330。dom

深度學習

競賽tutorial給出用word2vec詞向量特徵來作分類，並兩個生成特徵思路：學習

對每一條評論的全部詞向量求平均，將其平均值做爲改評論的特徵；
對訓練的詞向量作聚類，而後對評論中的詞類別進行計數，把這種bag-of-centroids做爲特徵。

把生成這種特徵餵給分類器，進行分類。可是，這種方法的AUC不是太理想（在0.91左右）。不管是作平均仍是聚類，一方面丟失了詞向量的特徵，另外一方面忽略了詞序還有詞的重要性。所以，分類效果不如tfidf化的n-gram。測試

大神Mikolov在推出word2vec以後，又鼓搗出了doc2vec（gensim有實現）。簡單地說，就是能夠把一段文本變成一個向量。與word2vec不一樣的是，參數除了doc對應的詞列表外，還有類別（TaggedDocument)。結果證實doc2vec的效果還不如word2vec生成特徵，AUC只有0.87915。rest

doc2vec = Doc2Vec(sentences, workers=8, size=300, min_count=40,
                window=10, sample=1e-4)

pangolulu嘗試把BoW與doc2vec作ensemble，採用stacking的思路——L1層BoW特徵作LR分類、doc2vec特徵作RBF-SVM分類，L2層將L1層的預測機率組合成一個新特徵，餵給LR分類器；屢次迭代後求平均。ensemble結構圖以下：

以上全部方法的AUC對好比下：

特徵	分類	AUC
BoW	RF	0.92154
(1,3) gram, tfidf	LR	0.96330
(1,3) gram, tfidf	RF	0.93058
word2vec + avg	RF	0.90798
word2vec + cluster	RF	0.91485
doc2vec	RF	0.87915
doc2vec	LR	0.90573
BoW, doc2vec	ensemble	0.93926