使用sklearn包下的樸素貝葉斯算法,它包含三種模型——高斯模型、多項式模型和伯努利模型,詳情能夠參考樸素貝葉斯 — scikit-learn 0.18.1 documentation。
本文將使用貝葉斯多項式模型類來解決英文郵件分類的問題。html
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from wordcloud import WordCloud
from sklearn.metrics import roc_curve, auc
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
%matplotlib inline
複製代碼
數據來自Spam Mails Dataset kaggle,其中正常郵件標記爲ham/0,垃圾郵件爲spam/1python
data = pd.read_csv('spam_ham_dataset.csv')
data = data.iloc[:, 1:]
data.head()
複製代碼
label | text | label_num | |
---|---|---|---|
0 | ham | Subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | ham | Subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | ham | Subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | spam | Subject: photoshop , windows , office . cheap ... | 1 |
4 | ham | Subject: re : indian springs\r\nthis deal is t... | 0 |
data.info()
複製代碼
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 3 columns):
label 5171 non-null object
text 5171 non-null object
label_num 5171 non-null int64
dtypes: int64(1), object(2)
memory usage: 121.3+ KB
複製代碼
print('這份數據包含{}條郵件'.format(data.shape[0]))
複製代碼
這份數據包含5171條郵件
複製代碼
print('正常郵件一共有{}條'.format(data['label_num'].value_counts()[0]))
print('垃圾郵件一共有{}條'.format(data['label_num'].value_counts()[1]))
plt.style.use('seaborn')
plt.figure(figsize=(6, 4), dpi=100)
data['label'].value_counts().plot(kind='bar')
複製代碼
正常郵件一共有3672條
垃圾郵件一共有1499條
複製代碼
新建一個DataFrame,全部的處理都在它裏面進行git
# 只須要text與label_num
new_data = data.iloc[:, 1:]
length = len(new_data)
print('郵件數量 length =', length)
new_data.head()
複製代碼
郵件數量 length = 5171
複製代碼
text | label_num | |
---|---|---|
0 | Subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | Subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | Subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | Subject: photoshop , windows , office . cheap ... | 1 |
4 | Subject: re : indian springs\r\nthis deal is t... | 0 |
查看部分具體內容github
for i in range(3):
print(i, '\n', data['text'][i])
複製代碼
0
Subject: enron methanol ; meter # : 988291
this is a follow up to the note i gave you on monday , 4 / 3 / 00 { preliminary
flow data provided by daren } .
please override pop ' s daily volume { presently zero } to reflect daily activity you can obtain from gas control . this change is needed asap for economics purposes . 1 Subject: hpl nom for january 9 , 2001 ( see attached file : hplnol 09 . xls ) - hplnol 09 . xls 2 Subject: neon retreat ho ho ho , we ' re around to that most wonderful time of the year - - - neon leaders retreat time !
i know that this time of year is extremely hectic , and that it ' s tough to think about anything past the holidays , but life does go on past the week of december 25 through january 1 , and that ' s what i ' d like you to think about for a minute . on the calender that i handed out at the beginning of the fall semester , the retreat was scheduled for the weekend of january 5 - 6 . but because of a youth ministers conference that brad and dustin are connected with that week , we ' re going to change the date to the following weekend , january 12 - 13 . now comes the part you need to think about .
i think we all agree that it ' s important for us to get together and have some time to recharge our batteries before we get to far into the spring semester , but it can be a lot of trouble and difficult for us to get away without kids , etc . so , brad came up with a potential alternative for how we can get together on that weekend , and then you can let me know which you prefer . the first option would be to have a retreat similar to what we ' ve done the past several years . this year we could go to the heartland country inn ( www . . com ) outside of brenham . it ' s a nice place , where we ' d have a 13 - bedroom and a 5 - bedroom house side by side . it ' s in the country , real relaxing , but also close to brenham and only about one hour and 15 minutes from here . we can golf , shop in the antique and craft stores in brenham , eat dinner together at the ranch , and spend time with each other . we ' d meet on saturday , and then return on sunday morning , just like what we ' ve done in the past . the second option would be to stay here in houston , have dinner together at a nice restaurant , and then have dessert and a time for visiting and recharging at one of our homes on that saturday evening . this might be easier , but the trade off would be that we wouldn ' t have as much time together . i ' ll let you decide . email me back with what would be your preference , and of course if you ' re available on that weekend . the democratic process will prevail - - majority vote will rule ! let me hear from you as soon as possible , preferably by the end of the weekend . and if the vote doesn ' t go your way , no complaining allowed ( like i tend to do ! ) have a great weekend , great golf , great fishing , great shopping , or whatever makes you happy ! bobby 複製代碼
郵件中含有大小寫,故將先單詞替換爲小寫正則表達式
new_data['text'] = new_data['text'].str.lower()
new_data.head()
複製代碼
text | label_num | |
---|---|---|
0 | subject: enron methanol ; meter # : 988291\r\n... | 0 |
1 | subject: hpl nom for january 9 , 2001\r\n( see... | 0 |
2 | subject: neon retreat\r\nho ho ho , we ' re ar... | 0 |
3 | subject: photoshop , windows , office . cheap ... | 1 |
4 | subject: re : indian springs\r\nthis deal is t... | 0 |
使用停用詞,郵件中出現的you、me、be等單詞對分類沒有影響,故能夠將其禁用。還要注意的是全部郵件的開頭中都含有單詞subject(主題),咱們也將其設爲停用詞。這裏使用天然語言處理工具包nltk下的stopwords算法
stop_words = set(stopwords.words('english'))
stop_words.add('subject')
複製代碼
提取一長串句子中的每一個單詞,而且還要過濾掉各類符號,因此這裏使用nltk下的RegexpTokenizer()函數,參數爲正則表達式,例如:spring
string = 'I have a pen,I have an apple. (Uhh~)Apple-pen!' # 來自《PPAP》的歌詞
RegexpTokenizer('[a-zA-Z]+').tokenize(string) # 過濾了全部的符號,返回一個列表
複製代碼
['I', 'have', 'a', 'pen', 'I', 'have', 'an', 'apple', 'Uhh', 'Apple', 'pen']
複製代碼
在英語裏面,一個單詞有不一樣的時態,好比love與loves,只是時態不一樣,可是是同一個意思,因而就有了——詞形還原與詞幹提取。而本文使用的詞形還原方法。詳情能夠參考:詞形還原工具對比 · ZMonster's Blogwindows
這裏先使用nltk包下的WordNetLemmatizer()函數,例如:bash
word = 'loves'
print('{}的原形爲{}'.format(word, WordNetLemmatizer().lemmatize(word)))
複製代碼
loves的原形爲love
複製代碼
把上面的全部操做一塊兒實現,使用pandas的applyapp
def text_process(text):
tokenizer = RegexpTokenizer('[a-z]+') # 只匹配單詞,因爲已經全爲小寫,故能夠只寫成[a-z]+
lemmatizer = WordNetLemmatizer()
token = tokenizer.tokenize(text) # 分詞
token = [lemmatizer.lemmatize(w) for w in token if lemmatizer.lemmatize(w) not in stop_words] # 停用詞+詞形還原
return token
複製代碼
new_data['text'] = new_data['text'].apply(text_process)
複製代碼
如今咱們獲得了一個比較乾淨的數據集了
new_data.head()
複製代碼
text | label_num | |
---|---|---|
0 | [enron, methanol, meter, follow, note, gave, m... | 0 |
1 | [hpl, nom, january, see, attached, file, hplno... | 0 |
2 | [neon, retreat, ho, ho, ho, around, wonderful,... | 0 |
3 | [photoshop, window, office, cheap, main, trend... | 1 |
4 | [indian, spring, deal, book, teco, pvr, revenu... | 0 |
將處理後的數據集分爲訓練集與測試集,比例爲3:1
seed = 20190524 # 讓實驗具備重複性
X = new_data['text']
y = new_data['label_num']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed) # 75%做爲訓練集與25%做爲測試集
複製代碼
train = pd.concat([X_train, y_train], axis=1) # 訓練集
test = pd.concat([X_test, y_test], axis=1) # 測試集
train.reset_index(drop=True, inplace=True) # 重設下標
test.reset_index(drop=True, inplace=True) # 同上
複製代碼
print('訓練集含有{}封郵件,測試集含有{}封郵件'.format(train.shape[0], test.shape[0]))
複製代碼
訓練集含有3878封郵件,測試集含有1293封郵件
複製代碼
訓練集中的垃圾郵件與正常郵件的數量
print(train['label_num'].value_counts())
plt.figure(figsize=(6, 4), dpi=100)
train['label_num'].value_counts().plot(kind='bar')
複製代碼
0 2769
1 1109
Name: label_num, dtype: int64
複製代碼
測試集中的垃圾郵件與正常郵件的數量
print(test['label_num'].value_counts())
plt.figure(figsize=(6, 4), dpi=100)
test['label_num'].value_counts().plot(kind='bar')
複製代碼
0 903
1 390
Name: label_num, dtype: int64
複製代碼
若是把全部的單詞都拿來統計,單詞表裏面的單詞仍是比較多的,這樣讓咱們的模型跑起來也是比較慢的,故這裏隨機抽取正常郵件與垃圾郵件各10封內的單詞做爲單詞表
ham_train = train[train['label_num'] == 0] # 正常郵件
spam_train = train[train['label_num'] == 1] # 垃圾郵件
ham_train_part = ham_train['text'].sample(10, random_state=seed) # 隨機抽取的10封正常郵件
spam_train_part = spam_train['text'].sample(10, random_state=seed) # 隨機抽取的10封垃圾郵件
part_words = [] # 部分的單詞
for text in pd.concat([ham_train_part, spam_train_part]):
part_words += text
複製代碼
part_words_set = set(part_words)
print('單詞表一共有{}個單詞'.format(len(part_words_set)))
複製代碼
單詞表一共有1528個單詞
複製代碼
這就大大減小了單詞量
接下來咱們要統計每一個單詞出現的次數,使用sklearn的CountVectorizer()函數,如:
words = ['This is the first sentence', 'And this is the second sentence']
cv = CountVectorizer() # 參數lowercase=True,將字母轉爲小寫,但數據已是小寫了
count = cv.fit_transform(words)
print('cv.vocabulary_:\n', cv.vocabulary_) # 返回一個字典
print('cv.get_feature_names:\n', cv.get_feature_names()) # 返回一個列表
print('count.toarray:\n', count.toarray()) # 返回序列
複製代碼
cv.vocabulary_:
{'this': 6, 'is': 2, 'the': 5, 'first': 1, 'sentence': 4, 'and': 0, 'second': 3}
cv.get_feature_names:
['and', 'first', 'is', 'second', 'sentence', 'the', 'this']
count.toarray:
[[0 1 1 0 1 1 1]
[1 0 1 1 1 1 1]]
複製代碼
[0 1 1 0 1 1 1] 對應 ['and', 'first', 'is', 'second', 'sentence', 'the', 'this'],即'first'出現1次,'is'出現1次,如此類推
接下來還要計算TF-IDF,它反映了單詞在文本中的重要程度。使用sklearn包下的TfidfTransformer(),如:
tfidf = TfidfTransformer()
tfidf_matrix = tfidf.fit_transform(count)
print('idf:\n', tfidf.idf_) # 查看idf
print('tfidf:\n', tfidf_matrix.toarray()) # 查看tf-idf
複製代碼
idf:
[1.40546511 1.40546511 1. 1.40546511 1. 1.
1. ]
tfidf:
[[0. 0.57496187 0.4090901 0. 0.4090901 0.4090901
0.4090901 ]
[0.49844628 0. 0.35464863 0.49844628 0.35464863 0.35464863
0.35464863]]
複製代碼
能夠看到 [0 1 1 0 1 1 1] 變爲了 [0. 0.57496187 0.4090901 0. 0.4090901 0.4090901 0.4090901 ]
如今正式開始各類計算,可是開始以前先把單詞整理成句子,就是CountVectorizer認識的格式
# 將正常郵件與垃圾郵件的單詞都整理爲句子,單詞間以空格相隔,CountVectorizer()的句子裏,單詞是以空格分隔的
train_part_texts = [' '.join(text) for text in np.concatenate((spam_train_part.values, ham_train_part.values))]
# 訓練集全部的單詞整理成句子
train_all_texts = [' '.join(text) for text in train['text']]
# 測試集全部的單詞整理成句子
test_all_texts = [' '.join(text) for text in test['text']]
複製代碼
cv = CountVectorizer()
part_fit = cv.fit(train_part_texts) # 以部分句子爲參考
train_all_count = cv.transform(train_all_texts) # 對訓練集全部郵件統計單詞個數
test_all_count = cv.transform(test_all_texts) # 對測試集全部郵件統計單詞個數
tfidf = TfidfTransformer()
train_tfidf_matrix = tfidf.fit_transform(train_all_count)
test_tfidf_matrix = tfidf.fit_transform(test_all_count)
複製代碼
print('訓練集', train_tfidf_matrix.shape)
print('測試集', test_tfidf_matrix.shape)
複製代碼
訓練集 (3878, 1513)
測試集 (1293, 1513)
複製代碼
mnb = MultinomialNB()
mnb.fit(train_tfidf_matrix, y_train)
複製代碼
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
複製代碼
模型在測試集上的正確率
mnb.score(test_tfidf_matrix, y_test)
複製代碼
0.9265274555297757
複製代碼
y_pred = mnb.predict_proba(test_tfidf_matrix)
fpr, tpr, thresholds = roc_curve(y_test, y_pred[:, 1])
auc = auc(fpr, tpr)
複製代碼
# roc 曲線
plt.figure(figsize=(6, 4), dpi=100)
plt.plot(fpr, tpr)
plt.title('roc = {:.4f}'.format(auc))
plt.xlabel('fpr')
plt.ylabel('tpr')
複製代碼
Text(0, 0.5, 'tpr')
複製代碼
ipynb文件移步:github