樸素貝葉斯—豆瓣Top250影評的情感分析與預測

前言

本文使用樸素貝葉斯算法實現豆瓣Top250電影評價的情感分析與預測。python

最近在學習天然語言正負面情感的處理問題，可是絕大部分能搜索到的實踐都是Kggle上IMDB影評的情感分析。git

因此在這裏我就用最基礎的樸素貝葉斯算法來對豆瓣的影評進行情感分析與預測。程序員

在這裏我參考了 github.com/aeternae/IM…，萬分感謝。github

樸素貝葉斯分類器

貝葉斯分類是一類分類算法的總稱，這類算法均以貝葉斯定理爲基礎，故統稱爲貝葉斯分類。算法

這種算法經常使用來作文章分類，垃圾郵、件垃圾評論分類，樸素貝葉斯的效果不錯而且成本很低。bash

已知某條件機率，如何獲得兩個事件交換後的機率，也就是在已知P(A|B)的狀況下如何求得P(B|A)。markdown

P(B|A)表示事件A已經發生的前提下，事件B發生的機率，叫作事件A發生下事件B的條件機率。app

樸素貝葉斯的公式dom

一個通俗易懂的視頻教程函數

Youtube www.youtube.com/watch?v=Aqo…

舉個不太恰當的例子

咱們想知道作程序員與禿頭之間的關係，咱們就能夠用樸素貝葉斯公式來進行計算。

咱們如今想求 P(禿頭|作程序員) 的機率，也就是作程序員就會禿頭的機率

我這輩子都不會禿頭 (((o(ﾟ▽ﾟ)o))) ！！！

代入樸素貝葉斯公式

P(禿頭|作程序員) = \frac{P(作程序員|禿頭)P(禿頭)}{P(作程序員)}

已知數據以下表

姓名	職業	是否禿頭
奎託斯	戰神	是
殺手47號	殺手	是
埼玉	超人	是
滅霸	計生辦主任	是
傑森斯坦森	硬漢	是
某某996程序員	程序員	是
我	程序員	否

基於樸素貝葉斯公式，由以上這張表咱們能夠求出：

P(禿頭|作程序員) = \frac{\frac16 * \frac67}{\frac27} = \frac{21}{42} = \frac{1}{2}

上面這個例子就簡單的描述了樸素貝葉斯公式的基本用法。

接下來我就使用豆瓣Top250排行榜的影評來使用樸素貝葉斯進行好評與差評的訓練與預測。

豆瓣Top250影評情感分析

首先須要豆瓣Top250影評的語料，我用Scrapy抓取了5w份語料，用於訓練與驗證。

豆瓣影評爬蟲 github.com/3inchtime/d…

有了語料以後咱們就能夠開始實際開發了。

這裏建議使用jupyter來開發操做。

如下代碼所有在個人Github上能夠看到，歡迎你們提出建議。

github.com/3inchtime/d…

首先加載語料

# -*- coding: utf-8 -*-
import random
import numpy as np
import csv
import jieba


file_path = './data/review.csv'
jieba.load_userdict('./data/userdict.txt')

# 讀取保存爲csv格式的語料
def load_corpus(corpus_path):
    with open(corpus_path, 'r') as f:
        reader = csv.reader(f)
        rows = [row for row in reader]

        
    review_data = np.array(rows).tolist()
    random.shuffle(review_data)

    review_list = []
    sentiment_list = []
    for words in review_data:
        review_list.append(words[1])
        sentiment_list.append(words[0])

    return review_list, sentiment_list
複製代碼

在訓練以前，通常均會對數據集作shuffle，打亂數據之間的順序，讓數據隨機化，這樣能夠避免過擬合。因此使用random.shuffle()方法打亂數據。

jieba.load_userdict('./data/userdict.txt')這裏我本身作了一個詞典，防止部分結巴分詞的不許確，能夠提升約1%左右的準確率。

好比不是很喜歡這句，jieba會分紅’不是‘，’很喜歡‘兩個詞，這樣致使這句話很大機率會被預測爲好評。

因此這裏我在自定義的詞典中分好了不少相似這樣的詞，提升了一點點準確率。

而後將所有的語料按1:4分爲測試集與訓練集

n = len(review_list) // 5

train_review_list, train_sentiment_list = review_list[n:], sentiment_list[n:]
test_review_list, test_sentiment_list = review_list[:n], sentiment_list[:n]

複製代碼

分詞

使用jieba分詞，將語料進行分詞，而且去除stopwords。

import re
import jieba


stopword_path = './data/stopwords.txt'


def load_stopwords(file_path):
    stop_words = []
    with open(file_path, encoding='UTF-8') as words:
       stop_words.extend([i.strip() for i in words.readlines()])
    return stop_words


def review_to_text(review):
    stop_words = load_stopwords(stopword_path)
    # 去除英文
    review = re.sub("[^\u4e00-\u9fa5^a-z^A-Z]", '', review)
    review = jieba.cut(review)
    # 去掉停用詞
    if stop_words:
        all_stop_words = set(stop_words)
        words = [w for w in review if w not in all_stop_words]

    return words

# 用於訓練的評論
review_train = [' '.join(review_to_text(review)) for review in train_review_list]
# 對於訓練評論對應的好評/差評
sentiment_train = train_sentiment_list

# 用於測試的評論
review_test = [' '.join(review_to_text(review)) for review in test_review_list]
# 對於測試評論對應的好評/差評
sentiment_test = test_sentiment_list
複製代碼

TF*IDF與詞頻向量化

TF-IDF（是一種經常使用於信息處理和數據挖掘的加權技術。根據詞語的在文本中出現的次數和在整個語料中出現的文檔頻率來計算一個詞語在整個語料中的重要程度。

它的優勢是能過濾掉一些常見的卻可有可無本的詞語，同時保留影響整個文本的重要字詞。

使用Countvectorizer()將一個文檔轉換爲向量，計算詞彙在文本中出現的頻率。

CountVectorizer類會將文本中的詞語轉換爲詞頻矩陣，例如矩陣中包含一個元素a[i] [j]，它表示j詞在i類文本下的詞頻。它經過fit_transform函數計算各個詞語出現的次數。

TfidfTransformer用於統計vectorizer中每一個詞語的TF-IDF值。

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

count_vec = CountVectorizer(max_df=0.8, min_df=3)

tfidf_vec = TfidfVectorizer()

# 定義Pipeline對所有步驟的流式化封裝和管理，能夠很方便地使參數集在新數據集（好比測試集）上被重複使用。
def MNB_Classifier():
    return Pipeline([
        ('count_vec', CountVectorizer()),
        ('mnb', MultinomialNB())
    ])

複製代碼

max_df 這個參數的做用是做爲一個閾值，當構造語料庫的關鍵詞集的時候，若是某個詞的詞頻大於max_df，這個詞不會被看成關鍵詞。

若是這個參數是float，則表示詞出現的次數與語料庫文檔數的百分比，若是是int，則表示詞出現的次數。

min_df相似於max_df，不一樣之處在於若是某個詞的詞頻小於min_df，則這個詞不會被看成關鍵詞

這樣咱們就成功的構造出了用於訓練以及測試的Pipeline

而後用 Pipeline.fit()對訓練集進行訓練

再直接用 Pipeline.score() 對測試集進行預測並評分

mnbc_clf = MNB_Classifier()

# 進行訓練
mnbc_clf.fit(review_train, sentiment_train)

# 測試集準確率
print('測試集準確率： {}'.format(mnbc_clf.score(review_test, sentiment_test)))

複製代碼

這樣咱們就完成了整個從訓練到測試的所有流程。

基本上測試集的正確率在79%-80%左右。

由於電影評論中有很大一部分好評中會有負面情感的詞語，例如在紀錄片《海豚灣》中

我以爲大部分看本片會有感的人，都不知道，中國的白暨豚已經滅絕8年了，也不會知道，長江裏的江豚也僅剩1000左右了。與其感慨，咒罵日本人如何捕殺海豚，不如作些實際的事情，保護一下長江裏的江豚吧，沒幾年，也將絕跡了。中國人作出來的事情，也不會比小日本好到哪兒去。

因此說若是將這種相似的好評去除，則能夠提升準確率。

保存訓練好的模型

# 先轉換成詞頻矩陣，再計算TFIDF值
tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(review_train))
# 樸素貝葉斯中的多項式分類器
clf = MultinomialNB().fit(tfidf, sentiment_train)

with open(model_export_path, 'wb') as file:
    d = {
        "clf": clf,
        "vectorizer": vectorizer,
        "tfidftransformer": tfidftransformer,
    }
    pickle.dump(d, file)
複製代碼

使用訓練好的模型進行影評情感預測

這裏我直接貼上所有的源代碼，代碼很是簡單，我將整個處理邏輯封裝爲一個類，這樣就很是方便使用了。

有須要直接能夠在個人Github上clone

# -*- coding: utf-8 -*-
import re
import pickle

import numpy as np
import jieba


class SentimentAnalyzer(object):
    def __init__(self, model_path, userdict_path, stopword_path):
        self.clf = None
        self.vectorizer = None
        self.tfidftransformer = None
        self.model_path = model_path
        self.stopword_path = stopword_path
        self.userdict_path = userdict_path
        self.stop_words = []
        self.tokenizer = jieba.Tokenizer()
        self.initialize()

    # 加載模型
    def initialize(self):
        with open(self.stopword_path, encoding='UTF-8') as words:
            self.stop_words = [i.strip() for i in words.readlines()]

        with open(self.model_path, 'rb') as file:
            model = pickle.load(file)
            self.clf = model['clf']
            self.vectorizer = model['vectorizer']
            self.tfidftransformer = model['tfidftransformer']
        if self.userdict_path:
            self.tokenizer.load_userdict(self.userdict_path)

    # 過濾文字中的英文與無關文字
    def replace_text(self, text):
        text = re.sub('((https?|ftp|file)://)?[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|].(com|cn)', '', text)
        text = text.replace('\u3000', '').replace('\xa0', '').replace('」', '').replace('"', '')
        text = text.replace(' ', '').replace('↵', '').replace('\n', '').replace('\r', '').replace('\t', '').replace('）', '')
        text_corpus = re.split('[！。？；……;]', text)
        return text_corpus

    # 情感分析計算
    def predict_score(self, text_corpus):
        # 分詞
        docs = [self.__cut_word(sentence) for sentence in text_corpus]
        new_tfidf = self.tfidftransformer.transform(self.vectorizer.transform(docs))
        predicted = self.clf.predict_proba(new_tfidf)
        # 四捨五入，保留三位
        result = np.around(predicted, decimals=3)
        return result

    # jieba分詞
    def __cut_word(self, sentence):
        words = [i for i in self.tokenizer.cut(sentence) if i not in self.stop_words]
        result = ' '.join(words)
        return result

    def analyze(self, text):
        text_corpus = self.replace_text(text)
        result = self.predict_score(text_corpus)

        neg = result[0][0]
        pos = result[0][1]

        print('差評： {} 好評： {}'.format(neg, pos))

複製代碼

使用時只要實例化這個分析器，並使用analyze()方法就能夠了。

# -*- coding: utf-8 -*-
from native_bayes_sentiment_analyzer import SentimentAnalyzer


model_path = './data/bayes.pkl'
userdict_path = './data/userdict.txt'
stopword_path = './data/stopwords.txt'
corpus_path = './data/review.csv'


analyzer = SentimentAnalyzer(model_path=model_path, stopword_path=stopword_path, userdict_path=userdict_path)
text = '倍感失望的一部諾蘭的電影，感受更像是盜夢幫的一場大雜燴。雖然看以前就知道確定是一部沒法超越前傳2的蝙蝠狹，但真心沒想到能差到這個地步。節奏的把控的失誤和角色的定位模糊絕對是整部影片的硬傷。'
analyzer.analyze(text=text)

複製代碼

github.com/3inchtime/d…

以上所有代碼均push到了個人Github上，歡迎你們提出建議。