爲了記錄在競賽中入門深度學習的過程,我開了一個新系列【從傳統方法到深度學習】。html
Kaggle競賽Bag of Words Meets Bags of Popcorn是電影評論(review)的情感分析,能夠視做爲短文本的二分類問題(正向、負向)。標註數據集長這樣:python
id sentiment review "2381_9" 1 "\"The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. ..." "2486_3" 0 "What happens when an army of wetbacks, towelheads, and Godless Eastern European commies gather their forces south of the border? Gary Busey kicks their butts, of course. Another laughable example of Reagan-era cultural fallout, Bulletproof wastes a decent supporting cast headed by L Q Jones and Thalmus Rasulala."
評價指標是AUC。所以,在測試數據集上應該給出機率而不是類別;即爲predict_proba
而不是predict
:git
# random frorest result = forest.predict_proba(test_data_features)[:, 1] # not `predict` result = forest.predict(test_data_features)
採用BoW特徵、RF (random forest)分類器,預測類別的AUC爲0.84436,預測機率的AUC則爲0.92154。github
傳統方法通常會使用到兩種特徵:BoW (bag of words),n-gram。BoW忽略了詞序,只是單純對詞計數;而n-gram則是考慮到了詞序,好比bigram詞對"dog run"、"run dog"是兩個不一樣的特徵。BoW能夠用CountVectorizer
向量化:app
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000) train_data_features = vectorizer.fit_transform(clean_train_reviews)
在一個句子中,不一樣的詞重要性是不一樣的;須要用TFIDF來給詞加權重。n-gram特徵則能夠用TfidfVectorizer
向量化:less
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=40000, ngram_range=(1, 3), sublinear_tf=True) train_x = vectorizer.fit_transform(clean_train_reviews)
使用unigram、bigram、trigram特徵 + RF分類器,AUC爲0.93058;若是改爲LR分類器,則AUC爲0.96330。dom
競賽tutorial給出用word2vec詞向量特徵來作分類,並兩個生成特徵思路:學習
把生成這種特徵餵給分類器,進行分類。可是,這種方法的AUC不是太理想(在0.91左右)。不管是作平均仍是聚類,一方面丟失了詞向量的特徵,另外一方面忽略了詞序還有詞的重要性。所以,分類效果不如tfidf化的n-gram。測試
大神Mikolov在推出word2vec以後,又鼓搗出了doc2vec(gensim有實現)。簡單地說,就是能夠把一段文本變成一個向量。與word2vec不一樣的是,參數除了doc對應的詞列表外,還有類別(TaggedDocument
)。結果證實doc2vec的效果還不如word2vec生成特徵,AUC只有0.87915。rest
doc2vec = Doc2Vec(sentences, workers=8, size=300, min_count=40, window=10, sample=1e-4)
pangolulu嘗試把BoW與doc2vec作ensemble,採用stacking的思路——L1層BoW特徵作LR分類、doc2vec特徵作RBF-SVM分類,L2層將L1層的預測機率組合成一個新特徵,餵給LR分類器;屢次迭代後求平均。ensemble結構圖以下:
以上全部方法的AUC對好比下:
特徵 | 分類 | AUC |
---|---|---|
BoW | RF | 0.92154 |
(1,3) gram, tfidf | LR | 0.96330 |
(1,3) gram, tfidf | RF | 0.93058 |
word2vec + avg | RF | 0.90798 |
word2vec + cluster | RF | 0.91485 |
doc2vec | RF | 0.87915 |
doc2vec | LR | 0.90573 |
BoW, doc2vec | ensemble | 0.93926 |
[1] Zygmunt Z., Classifying text with bag-of-words: a tutorial.
[2] Michael Czerny, Modern Methods for Sentiment Analysis.