Python 中的實用數據挖掘

時間 2019-11-30

原文原文鏈接

本文是 2014 年 12 月我在布拉格經濟大學作的名爲‘ Python 數據科學’講座的筆記。歡迎經過 @RadimRehurek 進行提問和評論。css

本次講座的目的是展現一些關於機器學習的高級概念。該筆記中用具體的代碼來作演示，你們能夠在本身的電腦上運行（須要安裝 IPython，以下所示）。html

本次講座的聽衆須要瞭解一些基礎的編程（不必定是 Python），並擁有一點基本的數據挖掘背景。本次講座不是機器學習專家的「高級演講」。python

這些代碼實例建立了一個有效的、可執行的原型系統：一個使用「spam」（垃圾信息）或「ham」（非垃圾信息）對英文手機短信（」短信類型「的英文）進行分類的 app。git

整套代碼使用 Python 語言。 python 是一種在管線（pipeline）的全部環節（I/O、數據清洗重整和預處理、模型訓練和評估）都好用的通用語言。儘管 python 不是惟一選擇，但它靈活、易於開發，性能優越，這得益於它成熟的科學計算生態系統。Python 龐大的、開源生態系統同時避免了任何單一框架或庫的限制（以及相關的信息丟失）。github

IPython notebook，是 Python 的一個工具，它是一個以 HTML 形式呈現的交互環境，能夠經過它馬上看到結果。咱們也將重溫其它普遍用於數據科學領域的實用工具。

想交互運行下面的例子（選讀）？

1. 安裝免費的 Anaconda Python 發行版，其中已經包含 Python 自己。

2. 安裝「天然語言處理」庫——TextBlob：安裝包在這。

3. 下載本文的源碼（網址： http://radimrehurek.com/data_science_python/data_science_python.ipynb 並運行： $ ipython notebook data_science_python.ipynb

4. 觀看 IPython notebook 基本用法教程 IPython tutorial video 。

5. 運行下面的第一個代碼，若是執行過程沒有報錯，就能夠了。

端到端的例子：自動過濾垃圾信息

In [1]:

%matplotlib inline

import matplotlib.pyplot as plt

import csv

from textblob import TextBlob

import pandas

import sklearn

import cPickle

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.naive_bayes import MultinomialNB

from sklearn.svm import SVC, LinearSVC

from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix

from sklearn.pipeline import Pipeline

from sklearn.grid_search import GridSearchCV

from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.learning_curve import learning_curve

第一步：加載數據，瀏覽一下

讓咱們跳過真正的第一步（完善資料，瞭解咱們要作的是什麼，這在實踐過程當中是很是重要的），直接到 https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection 下載 demo 裏須要用的 zip 文件，解壓到 data 子目錄下。你能看到一個大概 0.5MB 大小，名爲 SMSSpamCollection 的文件：

$ ls -l data

total 1352

-rw-r--r--@ 1 kofola staff 477907 Mar 15 2011 SMSSpamCollection

-rw-r--r--@ 1 kofola staff 5868 Apr 18 2011 readme

-rw-r-----@ 1 kofola staff 203415 Dec 1 15:30 smsspamcollection.zip

這份文件包含了 5000 多份 SMS 手機信息（查看 readme 文件以得到更多信息）：

In [2]:

1 2	messages = [line.rstrip() for line in open('./data/SMSSpamCollection')] print len(messages)

5574

文本集有時候也稱爲「語料庫」，咱們來打印 SMS 語料庫中的前 10 條信息：

In [3]:

1 2	for message_no, message in enumerate(messages[:10]): print message_no, message

0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives around here though

5 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

6 ham Even my brother is not like to speak with me. They treat me like aids patent.

7 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

8 spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

9 spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

咱們看到一個 TSV 文件（用製表符 tab 分隔），它的第一列是標記正常信息（ham）或「垃圾文件」（spam）的標籤，第二列是信息自己。

這個語料庫將做爲帶標籤的訓練集。經過使用這些標記了 ham/spam 例子，咱們將訓練一個自動分辨 ham/spam 的機器學習模型。而後，咱們能夠用訓練好的模型將任意未標記的信息標記爲 ham 或 spam。

咱們可使用 Python 的 Pandas 庫替咱們處理 TSV 文件（或 CSV 文件，或 Excel 文件）：

In [4]:

messages = pandas.read_csv('./data/SMSSpamCollection', sep='t', quoting=csv.QUOTE_NONE,

names=["label", "message"])

print messages

label message

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

5 spam FreeMsg Hey there darling it's been 3 week's n...

6 ham Even my brother is not like to speak with me. ...

7 ham As per your request 'Melle Melle (Oru Minnamin...

8 spam WINNER!! As a valued network customer you have...

9 spam Had your mobile 11 months or more? U R entitle...

10 ham I'm gonna be home soon and i don't want to tal...

11 spam SIX chances to win CASH! From 100 to 20,000 po...

12 spam URGENT! You have won a 1 week FREE membership ...

13 ham I've been searching for the right words to tha...

14 ham I HAVE A DATE ON SUNDAY WITH WILL!!

15 spam XXXMobileMovieClub: To use your credit, click ...

16 ham Oh k...i'm watching here:)

17 ham Eh u remember how 2 spell his name... Yes i di...

18 ham Fine if thats the way u feel. Thats the way ...

19 spam England v Macedonia - dont miss the goals/team...

20 ham Is that seriously how you spell his name?

21 ham I‘m going to try for 2 months ha ha only joking

22 ham So ü pay first lar... Then when is da stock co...

23 ham Aft i finish my lunch then i go str down lor. ...

24 ham Ffffffffff. Alright no way I can meet up with ...

25 ham Just forced myself to eat a slice. I'm really ...

26 ham Lol your always so convincing.

27 ham Did you catch the bus ? Are you frying an egg ...

28 ham I'm back &amp; we're packing the car now, I'll...

29 ham Ahhh. Work. I vaguely remember that! What does...

... ... ...

5544 ham Armand says get your ass over to epsilon

5545 ham U still havent got urself a jacket ah?

5546 ham I'm taking derek &amp; taylor to walmart, if I...

5547 ham Hi its in durban are you still on this number

5548 ham Ic. There are a lotta childporn cars then.

5549 spam Had your contract mobile 11 Mnths? Latest Moto...

5550 ham No, I was trying it all weekend ;V

5551 ham You know, wot people wear. T shirts, jumpers, ...

5552 ham Cool, what time you think you can get here?

5553 ham Wen did you get so spiritual and deep. That's ...

5554 ham Have a safe trip to Nigeria. Wish you happines...

5555 ham Hahaha..use your brain dear

5556 ham Well keep in mind I've only got enough gas for...

5557 ham Yeh. Indians was nice. Tho it did kane me off ...

5558 ham Yes i have. So that's why u texted. Pshew...mi...

5559 ham No. I meant the calculation is the same. That ...

5560 ham Sorry, I'll call later

5561 ham if you aren't here in the next &lt;#&gt; hou...

5562 ham Anything lor. Juz both of us lor.

5563 ham Get me out of this dump heap. My mom decided t...

5564 ham Ok lor... Sony ericsson salesman... I ask shuh...

5565 ham Ard 6 like dat lor.

5566 ham Why don't you wait 'til at least wednesday to ...

5567 ham Huh y lei...

5568 spam REMINDER FROM O2: To get 2.50 pounds free call...

5569 spam This is the 2nd time we have tried 2 contact u...

5570 ham Will ü b going to esplanade fr home?

5571 ham Pity, * was in mood for that. So...any other s...

5572 ham The guy did some bitching but I acted like i'd...

5573 ham Rofl. Its true to its name

[5574 rows x 2 columns]

咱們也可使用 pandas 輕鬆查看統計信息：

In [5]:

1	messages.groupby('label').describe()

out[5]:

label
		message
ham	count	4827
	unique	4518
	top	Sorry, I’ll call later
	freq	30
spam	count	747
	unique	653
	top	Please call our customer service representativ…
	freq	4

這些信息的長度是多少：

In [6]:

1 2	messages['length'] = messages['message'].map(lambda text: len(text)) print messages.head()

label message length

0 ham Go until jurong point, crazy.. Available only ... 111

1 ham Ok lar... Joking wif u oni... 29

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155

3 ham U dun say so early hor... U c already then say... 49

4 ham Nah I don't think he goes to usf, he lives aro... 61

In [7]:

1	messages.length.plot(bins=20, kind='hist')

Out[7]:

1	<matplotlib.axes._subplots.AxesSubplot at 0x10dd7a990>

In [8]:

1	messages.length.describe()

Out[8]:

count 5574.000000

mean 80.604593

std 59.919970

min 2.000000

25% 36.000000

50% 62.000000

75% 122.000000

max 910.000000

Name: length, dtype: float64

哪些是超長信息？

In [9]:

1	print list(messages.message[messages.length > 900])

["For me the love should start with attraction.i should feel that I need her every time

around me.she should be the first thing which comes in my thoughts.I would start the day and

end it with her.she should be there every time I dream.love will be then when my every

breath has her name.my life should happen around her.my life will be named to her.I would

cry for her.will give all my happiness and take all her sorrows.I will be ready to fight

with anyone for her.I will be in love when I will be doing the craziest things for her.love

will be when I don't have to proove anyone that my girl is the most beautiful lady on the

whole planet.I will always be singing praises for her.love will be when I start up making

chicken curry and end up makiing sambar.life will be the most beautiful then.will get every

morning and thank god for the day because she is with me.I would like to say a lot..will

tell later.."]

spam 信息與 ham 信息在長度上有區別嗎？

In [10]:

1	messages.hist(column='length', by='label', bins=50)

Out[10]:

1 2	array([<matplotlib.axes._subplots.AxesSubplot object at 0x11270da50>, <matplotlib.axes._subplots.AxesSubplot object at 0x1126c7750>], dtype=object)

太棒了，可是咱們怎麼能讓電腦本身識別文字信息？它能夠理解這些胡言亂語嗎？

第二步：數據預處理

這一節咱們將原始信息（字符序列）轉換爲向量（數字序列）；

這裏的映射並不是一對一的，咱們要用詞袋模型（bag-of-words）把每一個不重複的詞用一個數字來表示。算法

與第一步的方法同樣，讓咱們寫一個將信息分割成單詞的函數：

In [11]:

def split_into_tokens(message):

message = unicode(message, 'utf8') # convert bytes into proper unicode

return TextBlob(message).words

這仍是原始文本的一部分:

In [12]:

1	messages.message.head()

Out[12]:

0 Go until jurong point, crazy.. Available only ...

1 Ok lar... Joking wif u oni...

2 Free entry in 2 a wkly comp to win FA Cup fina...

3 U dun say so early hor... U c already then say...

4 Nah I don't think he goes to usf, he lives aro...

Name: message, dtype: object

這是原始文本處理後的樣子：

In [13]:

1	messages.message.head().apply(split_into_tokens)

Out[13]:

0 [Go, until, jurong, point, crazy, Available, o...

1 [Ok, lar, Joking, wif, u, oni]

2 [Free, entry, in, 2, a, wkly, comp, to, win, F...

3 [U, dun, say, so, early, hor, U, c, already, t...

4 [Nah, I, do, n't, think, he, goes, to, usf, he...

Name: message, dtype: object

天然語言處理（NLP）的問題：

大寫字母是否攜帶信息？
單詞的不一樣形式（「goes」和「go」）是否攜帶信息？
嘆詞和限定詞是否攜帶信息？

換句話說，咱們想對文本進行更好的標準化。編程

咱們使用 textblob 獲取 part-of-speech (POS) 標籤：數組

In [14]:

1	TextBlob("Hello world, how is it going?").tags # list of (word, POS) pairs

Out[14]:

[(u'Hello', u'UH'),

(u'world', u'NN'),

(u'how', u'WRB'),

(u'is', u'VBZ'),

(u'it', u'PRP'),

(u'going', u'VBG')]

並將單詞標準化爲基本形式 ( lemmas)：

In [15]:

def split_into_lemmas(message):

message = unicode(message, 'utf8').lower()

words = TextBlob(message).words

# for each word, take its "base form" = lemma

return [word.lemma for word in words]

messages.message.head().apply(split_into_lemmas)

Out[15]:

0 [go, until, jurong, point, crazy, available, o...

1 [ok, lar, joking, wif, u, oni]

2 [free, entry, in, 2, a, wkly, comp, to, win, f...

3 [u, dun, say, so, early, hor, u, c, already, t...

4 [nah, i, do, n't, think, he, go, to, usf, he, ...

Name: message, dtype: object

這樣就好多了。你也許還會想到更多的方法來改進預處理：解碼 HTML 實體（咱們上面看到的 &amp 和 &lt）；過濾掉停用詞 (代詞等)；添加更多特徵，好比全部字母大寫標識等等。

第三步：數據轉換爲向量

如今，咱們將每條消息（詞幹列表）轉換成機器學習模型能夠理解的向量。網絡

用詞袋模型完成這項工做須要三個步驟：app

1. 對每一個詞在每條信息中出現的次數進行計數（詞頻）；

2. 對計數進行加權，這樣常常出現的單詞將會得到較低的權重（逆向文件頻率）；

3. 將向量由原始文本長度歸一化到單位長度（L2 範式）。

每一個向量的維度等於 SMS 語料庫中包含的獨立詞的數量。

In [16]:

1 2	bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message']) print len(bow_transformer.vocabulary_)

8874

這裏咱們使用強大的 python 機器學習訓練庫 scikit-learn (sklearn)，它包含大量的方法和選項。

咱們取一個信息並使用新的 bow_tramsformer 獲取向量形式的詞袋模型計數:

In [17]:

1 2	message4 = messages['message'][3] print message4

1	U dun say so early hor... U c already then say...

In [18]:

bow4 = bow_transformer.transform([message4])

print bow4

print bow4.shape

(0, 1158) 1

(0, 1899) 1

(0, 2897) 1

(0, 2927) 1

(0, 4021) 1

(0, 6736) 2

(0, 7111) 1

(0, 7698) 1

(0, 8013) 2

(1, 8874)

message 4 中有 9 個獨立詞，它們中的兩個出現了兩次，其他的只出現了一次。可用性檢測，哪些詞出現了兩次？

In [19]:

1 2	print bow_transformer.get_feature_names()[6736] print bow_transformer.get_feature_names()[8013]

say

整個 SMS 語料庫的詞袋計數是一個龐大的稀疏矩陣：

In [20]:

messages_bow = bow_transformer.transform(messages['message'])

print 'sparse matrix shape:', messages_bow.shape

print 'number of non-zeros:', messages_bow.nnz

print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))

sparse matrix shape: (5574, 8874)

number of non-zeros: 80272

sparsity: 0.16%

最終，計數後，使用 scikit-learn 的 TFidfTransformer 實現的 TF-IDF 完成詞語加權和歸一化。

In [21]:

tfidf_transformer = TfidfTransformer().fit(messages_bow)

tfidf4 = tfidf_transformer.transform(bow4)

print tfidf4

(0, 8013) 0.305114653686

(0, 7698) 0.225299911221

(0, 7111) 0.191390347987

(0, 6736) 0.523371210191

(0, 4021) 0.456354991921

(0, 2927) 0.32967579251

(0, 2897) 0.303693312742

(0, 1899) 0.24664322833

(0, 1158) 0.274934159477

單詞「u」的 IDF（逆向文件頻率）是什麼？單詞「university」的 IDF 又是什麼？

In [22]:

1 2	print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']] print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]

1 2	2.85068150539 8.23975323521

將整個 bag-of-words 語料庫轉化爲 TF-IDF 語料庫。

In [23]:

1 2	messages_tfidf = tfidf_transformer.transform(messages_bow) print messages_tfidf.shape

1	(5574, 8874)

有許多方法能夠對數據進行預處理和向量化。這兩個步驟也能夠稱爲「特徵工程」，它們一般是預測過程當中最耗時間和最無趣的部分，可是它們很是重要而且須要經驗。訣竅在於反覆評估：分析模型偏差，改進數據清洗和預處理方法，進行頭腦風暴討論新功能，評估等等。

第四步：訓練模型,檢測垃圾信息

咱們使用向量形式的信息來訓練 spam/ham 分類器。這部分很簡單，有不少實現訓練算法的庫文件。

這裏咱們使用 scikit-learn，首先選擇 Naive Bayes 分類器：

In [24]:

1	%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label'])

1 2	CPU times: user 4.51 ms, sys: 987 µs, total: 5.49 ms Wall time: 4.77 ms

咱們來試着分類一個隨機信息：

In [25]:

1 2	print 'predicted:', spam_detector.predict(tfidf4)[0] print 'expected:', messages.label[3]

1 2	predicted: ham expected: ham

太棒了！你也能夠用本身的文本試試。

有一個很天然的問題是：咱們能夠正確分辨多少信息？

In [26]:

1 2	all_predictions = spam_detector.predict(messages_tfidf) print all_predictions

1	['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']

In [27]:

print 'accuracy', accuracy_score(messages['label'], all_predictions)

print 'confusion matrixn', confusion_matrix(messages['label'], all_predictions)

print '(row=expected, col=predicted)'

accuracy 0.969501255831

confusion matrix

[[4827 0]

[ 170 577]]

(row=expected, col=predicted)

In [28]:

plt.matshow(confusion_matrix(messages['label'], all_predictions), cmap=plt.cm.binary, interpolation='nearest')

plt.title('confusion matrix')

plt.colorbar()

plt.ylabel('expected label')

plt.xlabel('predicted label')

Out[28]:

1	<matplotlib.text.Text at 0x11643f6d0>

咱們能夠經過這個混淆矩陣計算精度（precision）和召回率（recall），或者它們的組合（調和平均值）F1：

In [29]:

1	print classification_report(messages['label'], all_predictions)

precision recall f1-score support

ham 0.97 1.00 0.98 4827

spam 1.00 0.77 0.87 747

avg / total 0.97 0.97 0.97 5574

有至關多的指標均可以用來評估模型性能，至於哪一個最合適是由任務決定的。好比，將「spam」錯誤預測爲「ham」的成本遠低於將「ham」錯誤預測爲「spam」的成本。

第五步：如何進行實驗？

在上述「評價」中，咱們犯了個大忌。爲了簡單的演示，咱們使用訓練數據進行了準確性評估。永遠不要評估你的訓練數據。這是錯誤的。

這樣的評估方法不能告訴咱們模型的實際預測能力，若是咱們記住訓練期間的每一個例子，訓練的準確率將很是接近 100%，可是咱們不能用它來分類任何新信息。

一個正確的作法是將數據分爲訓練集和測試集，在模型擬合和調參時只能使用訓練數據，不能以任何方式使用測試數據，經過這個方法確保模型沒有「做弊」，最終使用測試數據評價模型能夠表明模型真正的預測性能。

In [30]:

msg_train, msg_test, label_train, label_test =

train_test_split(messages['message'], messages['label'], test_size=0.2)

print len(msg_train), len(msg_test), len(msg_train) + len(msg_test)

1	4459 1115 5574

按照要求，測試數據佔整個數據集的 20%（總共 5574 條記錄中的 1115 條），其他的是訓練數據（5574 條中的 4459 條）。

讓咱們回顧整個流程，將全部步驟放入 scikit-learn 的 Pipeline 中:

In [31]:

def split_into_lemmas(message):

message = unicode(message, 'utf8').lower()

words = TextBlob(message).words

# for each word, take its "base form" = lemma

return [word.lemma for word in words]

pipeline = Pipeline([

('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts

('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores

('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier

])

實際當中一個常見的作法是將訓練集再次分割成更小的集合，例如，5 個大小相等的子集。而後咱們用 4 個子集訓練數據，用最後 1 個子集計算精度（稱之爲「驗證集」）。重複5次（每次使用不一樣的子集進行驗證），這樣能夠獲得模型的「穩定性「。若是模型使用不一樣子集的得分差別很是大，那麼極可能哪裏出錯了（壞數據或者不良的模型方差）。返回，分析錯誤，從新檢查輸入數據有效性，從新檢查數據清洗。

在這個例子裏，一切進展順利：

In [32]:

scores = cross_val_score(pipeline, # steps to convert raw messages into models

msg_train, # training data

label_train, # training labels

cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring

scoring='accuracy', # which scoring metric?

n_jobs=-1, # -1 = use all cores = faster

)

print scores

1 2	[ 0.93736018 0.96420582 0.94854586 0.94183445 0.96412556 0.94382022 0.94606742 0.96404494 0.94831461 0.94606742]

得分確實比訓練所有數據時差一點點（ 5574 個訓練例子中，準確性 0.97），可是它們至關穩定：

In [33]:

1	print scores.mean(), scores.std()

1	0.9504386476 0.00947200821389

咱們天然會問，如何改進這個模型？這個得分已經很高了，可是咱們一般如何改進模型呢？

Naive Bayes 是一個高誤差-低方差的分類器（簡單且穩定，不易過分擬合）。與其相反的例子是低誤差-高方差（容易過分擬合）的 k 最臨近（kNN）分類器和決策樹。Bagging（隨機森林）是一種經過訓練許多（高方差）模型和求均值來下降方差的方法。

換句話說：

高誤差 = 分類器比較執拗。它有本身的想法，數據可以改變的空間有限。另外一方面，也沒有多少過分擬合的空間（左圖）。
低誤差 = 分類器更聽話，但也更神經質。你們都知道，讓它作什麼就作什麼可能形成麻煩（右圖）。

In [34]:

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,

n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):

"""

Generate a simple plot of the test and traning learning curve.

Parameters

----------

estimator : object type that implements the "fit" and "predict" methods

An object of that type which is cloned for each validation.

title : string

Title for the chart.

X : array-like, shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and

n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional

Target relative to X for classification or regression;

None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional

Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional

If an integer is passed, it is the number of folds (defaults to 3).

Specific cross-validation objects can be passed, see

sklearn.cross_validation module for the list of possible objects

n_jobs : integer, optional

Number of jobs to run in parallel (default 1).

"""

plt.figure()

plt.title(title)

if ylim is not None:

plt.ylim(*ylim)

plt.xlabel("Training examples")

plt.ylabel("Score")

train_sizes, train_scores, test_scores = learning_curve(

estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

train_scores_mean = np.mean(train_scores, axis=1)

train_scores_std = np.std(train_scores, axis=1)

test_scores_mean = np.mean(test_scores, axis=1)

test_scores_std = np.std(test_scores, axis=1)

plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,

train_scores_mean + train_scores_std, alpha=0.1,

color="r")

plt.fill_between(train_sizes, test_scores_mean - test_scores_std,

test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",

label="Training score")

plt.plot(train_sizes, test_scores_mean, 'o-', color="g",

label="Cross-validation score")

plt.legend(loc="best")

return plt

In [35]:

1	%time plot_learning_curve(pipeline, "accuracy vs. training set size", msg_train, label_train, cv=5)

1 2	CPU times: user 382 ms, sys: 83.1 ms, total: 465 ms Wall time: 28.5 s

Out[35]:

1	<module 'matplotlib.pyplot' from '/Volumes/work/workspace/vew/sklearn_intro/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>

（咱們對數據的 64% 進行了有效訓練：保留 20% 的數據做爲測試集，保留剩餘的 20% 作 5 折交叉驗證 = > 0.8*0.8*5574 = 3567個訓練數據。）

隨着性能的提高，訓練和交叉驗證都表現良好，咱們發現因爲數據量較少，這個模型難以足夠複雜/靈活地捕獲全部的細微差異。在這種特殊案例中，無論怎樣作精度都很高，這個問題看起來不是很明顯。

關於這一點，咱們有兩個選擇：

使用更多的訓練數據，增長模型的複雜性；
使用更復雜（更低誤差）的模型，從現有數據中獲取更多信息。

在過去的幾年裏，隨着收集大規模訓練數據愈來愈容易，機器愈來愈快。方法 1 變得愈來愈流行（更簡單的算法，更多的數據）。簡單的算法（如 Naive Bayes）也有更容易解釋的額外優點（相對一些更復雜的黑箱模型，如神經網絡）。

瞭解瞭如何正確地評估模型，咱們如今能夠開始研究參數對性能有哪些影響。

第六步：如何調整參數？

到目前爲止，咱們看到的只是冰山一角，還有許多其它參數須要調整。好比使用什麼算法進行訓練。

上面咱們已經使用了 Navie Bayes，可是 scikit-learn 支持許多分類器：支持向量機、最鄰近算法、決策樹、Ensamble 方法等…

咱們會問：IDF 加權對準確性有什麼影響？消耗額外成本進行詞形還原（與只用純文字相比）真的會有效果嗎？

讓咱們來看看：

In [37]:

params = {

'tfidf__use_idf': (True, False),

'bow__analyzer': (split_into_lemmas, split_into_tokens),

}

grid = GridSearchCV(

pipeline, # pipeline from above

params, # parameters to tune via cross validation

refit=True, # fit using all available data at the end, on the best found param combination

n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"

scoring='accuracy', # what score are we optimizing?

cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use

)

In [38]:

%time nb_detector = grid.fit(msg_train, label_train)

print nb_detector.grid_scores_

CPU times: user 4.09 s, sys: 291 ms, total: 4.38 s

Wall time: 20.2 s

[mean: 0.94752, std: 0.00357, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_lemmas at 0x1131e8668>}, mean: 0.92958, std: 0.00390, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_lemmas at 0x1131e8668>}, mean: 0.94528, std: 0.00259, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_tokens at 0x11270b7d0>}, mean: 0.92868, std: 0.00240, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_tokens at 0x11270b7d0>}]

（首先顯示最佳參數組合：在這個案例中是使用 idf=True 和 analyzer=split_into_lemmas 的參數組合）

快速合理性檢查

In [39]:

1 2	print nb_detector.predict_proba(["Hi mom, how are you?"])[0] print nb_detector.predict_proba(["WINNER! Credit for free!"])[0]

1 2	[ 0.99383955 0.00616045] [ 0.29663109 0.70336891]

predict_proba 返回每類（ham，spam）的預測機率。在第一個例子中，消息被預測爲 ham 的機率 >99%，被預測爲 spam 的機率 <1%。若是進行選擇模型會認爲信息是」ham「：

In [40]:

1 2	print nb_detector.predict(["Hi mom, how are you?"])[0] print nb_detector.predict(["WINNER! Credit for free!"])[0]

ham

spam

在訓練期間沒有用到的測試集的總體得分：

In [41]:

predictions = nb_detector.predict(msg_test)

print confusion_matrix(label_test, predictions)

print classification_report(label_test, predictions)

[[973 0]

[ 46 96]]

precision recall f1-score support

ham 0.95 1.00 0.98 973

spam 1.00 0.68 0.81 142

avg / total 0.96 0.96 0.96 1115

這是咱們使用詞形還原、TF-IDF 和 Navie Bayes 分類器的 ham 檢測 pipeline 得到的實際預測性能。

讓咱們嘗試另外一個分類器：支持向量機（SVM）。SVM 能夠很是迅速的獲得結果，它所須要的參數調整也不多（雖然比 Navie Bayes 稍多一點），在處理文本數據方面它是個好的起點。

In [42]:

pipeline_svm = Pipeline([

('bow', CountVectorizer(analyzer=split_into_lemmas)),

('tfidf', TfidfTransformer()),

('classifier', SVC()), # <== change here

])

# pipeline parameters to automatically explore and tune

param_svm = [

{'classifier__C': [1, 10, 100, 1000], 'classifier__kernel': ['linear']},

{'classifier__C': [1, 10, 100, 1000], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},

]

grid_svm = GridSearchCV(

pipeline_svm, # pipeline from above

param_grid=param_svm, # parameters to tune via cross validation

refit=True, # fit using all data, on the best detected classifier

n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"

scoring='accuracy', # what score are we optimizing?

cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use

)

In [43]:

%time svm_detector = grid_svm.fit(msg_train, label_train) # find the best combination from param_svm

print svm_detector.grid_scores_

CPU times: user 5.24 s, sys: 170 ms, total: 5.41 s

Wall time: 1min 8s

[mean: 0.98677, std: 0.00259, params: {'classifier__kernel': 'linear', 'classifier__C': 1}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 10}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 100}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 1000}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.98722, std: 0.00280, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.0001,

所以，很明顯的，具備 C=1 的線性核函數是最好的參數組合。

再一次合理性檢查：

In [44]:

1 2	print svm_detector.predict(["Hi mom, how are you?"])[0] print svm_detector.predict(["WINNER! Credit for free!"])[0]

ham

spam

In [45]:

1 2	print confusion_matrix(label_test, svm_detector.predict(msg_test)) print classification_report(label_test, svm_detector.predict(msg_test))

[[965 8]

[ 13 129]]

precision recall f1-score support

ham 0.99 0.99 0.99 973

spam 0.94 0.91 0.92 142

avg / total 0.98 0.98 0.98 1115

這是咱們使用 SVM 時能夠從 spam 郵件檢測流程中得到的實際預測性能。

第七步：生成預測器

通過基本分析和調優，真正的工做（工程）開始了。

生成預測器的最後一步是再次對整個數據集合進行訓練，以充分利用全部可用數據。固然，咱們將使用上面交叉驗證找到的最好的參數。這與咱們開始作的很是類似，但此次深刻了解它的行爲和穩定性。在不一樣的訓練/測試子集進行評價。

最終的預測器能夠序列化到磁盤，以便咱們下次想使用它時，能夠跳過全部訓練直接使用訓練好的模型：

In [46]:

# store the spam detector to disk after training

with open('sms_spam_detector.pkl', 'wb') as fout:

cPickle.dump(svm_detector, fout)

# ...and load it back, whenever needed, possibly on a different machine

svm_detector_reloaded = cPickle.load(open('sms_spam_detector.pkl'))

加載的結果是一個與原始對象表現相同的對象：

In [47]:

1 2	print 'before:', svm_detector.predict([message4])[0] print 'after:', svm_detector_reloaded.predict([message4])[0]

1 2	before: ham after: ham

生產執行的另外一個重要部分是性能。通過快速、迭代模型調整和參數搜索以後，性能良好的模型能夠被翻譯成不一樣的語言並優化。能夠犧牲幾個點的準確性換取一個更小、更快的模型嗎？是否值得優化內存使用狀況，或者使用 mmap 跨進程共享內存？電動叉車

請注意，優化並不老是必要的，要從實際狀況出發。

還有一些須要考慮的問題，好比，生產流水線還須要考慮魯棒性（服務故障轉移、冗餘、負載平衡）、監測（包括異常自動報警）、HR 可替代性（避免關於工做如何完成的「知識孤島」、晦澀/鎖定的技術、調整結果的黑藝術）。如今，開源世界均可覺得全部這些領域提供可行的解決方法，因爲 OSI 批准的開源許可證，今天展現的全部工具均可以避免費用於商業用途。