[Scikit-learn教程] 03.02 文本處理：分類與優化

時間 2019-11-17

標籤 scikit learn 教程 03.02 文本處理分類優化简体版

原文原文鏈接

回顧

上一節咱們經過Scikit-learn提供的多種方法從網絡以及硬盤獲取到了原始的文本數據，並採用tf-idf方法成功地提取了文本特徵，你能夠從下面的例子中再次複習這一過程。算法

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# 選取參與分析的文本類別
categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
# 從硬盤獲取原始數據
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
# 統計詞語出現次數
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
# 使用tf-idf方法提取文本特徵
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# 打印特徵矩陣規格
print(X_train_tfidf.shape)
複製代碼

本節咱們將在完成特徵提取工做的基礎上，繼續完成文本信息挖掘的下一步——訓練並優化分類器。數組

訓練分類器

能夠用於文本分類的機器學習算法有不少，樸素貝葉斯算法（Naïve Bayes）就是其中一個優秀表明。Scikit-learn包含了樸素貝葉斯算法的多種改進模型，最適於文本詞數統計方面的模型叫作多項式樸素貝葉斯（Multinomial Naïve Bayes），它能夠經過如下的方式來調用。bash

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)[/amalthea_pre_exercise_code]
[amalthea_sample_code]
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
print(&quot;分類器的相關信息：&quot;)
print(clf)
複製代碼

這樣就完成了一個分類器的訓練過程。爲了使用一個新文檔來進行分類器的分類預測工做，咱們必須使用一樣的數據處理手段處理咱們的新文檔。以下面的例子所示，咱們使用了一組自定義的字符串，用來判斷它們的分類狀況。字符串組必須通過transform方法的處理才能進行預測。網絡

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
# 預測用的新字符串，你能夠將其替換爲任意英文句子
docs_new = ["Nvidia is awesome!"]
# 字符串處理
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

# 進行預測
predicted = clf.predict(X_new_tfidf)

# 打印預測結果
for doc, category in zip(docs_new, predicted):
    print("%r =&gt; %s" % (doc, twenty_train.target_names[category]))
複製代碼

做爲受西方理論指導的一種基礎的機器學習算法，樸素貝葉斯雖然很簡單，有時候很樸素，可是它的運行速度很是的快，效果也很是的理想，可以跟不少更復雜的算法相提並論。app

創建Pipeline

爲了簡化對於原始數據的清洗、特徵提取以及分類過程，Scikit-learn提供了Pipeline類來實現一個整合式的分類器創建過程。分類器能夠經過創建一個Pipeline的方式來實現，而各類特徵提取、分類方法均可以在創建Pipeline的時候直接指定，從而大大提升編碼和調試的效率，以下所示：dom

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
categories = ["alt.atheism", "soc.religion.christian",
              "comp.graphics", "sci.med"]
twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",
        categories=categories,
        load_content = True,
        encoding="latin1",
        decode_error="strict",
        shuffle=True, random_state=42)[/amalthea_pre_exercise_code]
[amalthea_sample_code]
from sklearn.pipeline import Pipeline
# 創建Pipeline
text_clf = Pipeline([("vect", CountVectorizer()),
                     ("tfidf", TfidfTransformer()),
                     ("clf", MultinomialNB()),
])
# 訓練分類器
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
# 打印分類器信息
print(text_clf)
複製代碼

使用測試數據評估分類器性能

咱們能夠採用上述的方法對測試數據集進行預測，而後使用Numpy所提供的函數獲得評測結果：機器學習

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
if "text_clf" not in dir() :
	categories = ["alt.atheism", "soc.religion.christian",
	              "comp.graphics", "sci.med"]
	twenty_train=load_files("/mnt/vol0/sklearn/20news-bydate-train",categories=categories,  load_content = True, 
	                           encoding="latin1", decode_error="strict",shuffle=True, random_state=42)
	text_clf = Pipeline([("vect", CountVectorizer()),
	                     ("tfidf", TfidfTransformer()),
	                     ("clf", MultinomialNB()),
	])
	text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
import numpy as np
# 獲取測試數據
twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,
                        categories=categories,
                        load_content = True, 
                        encoding=&#39;latin1&#39;,
                        decode_error=&#39;strict&#39;,
                        shuffle=True, random_state=42)
docs_test = twenty_test.data
# 使用測試數據進行分類預測
predicted = text_clf.predict(docs_test)
# 計算預測結果的準確率
print(&quot;準確率爲：&quot;)
print(np.mean(predicted == twenty_test.target))
複製代碼

若是正常運行上述代碼，咱們應該能夠獲得83.4%的準確率。咱們有不少辦法來改進這個成績，使用業界公認的最適於文本分類的算法——支持向量機（SVM，Support Vector Machine）就是一個很好的方向（雖然它會比樸素貝葉斯稍微慢一點）。咱們能夠經過改變Pipeline中分類器所指定的對象輕鬆地實現這一點：函數

import numpy as np
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;twenty_train&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
if &#39;twenty_test&#39; not in dir() :
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data

from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
                     (&#39;tfidf&#39;, TfidfTransformer()),
                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;,
                                            penalty=&#39;l2&#39;,
                                            alpha=1e-3,
                                            n_iter=5,
                                            random_state=42)),
                    ])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
print(&quot;準確率：&quot;)
print(np.mean(predicted == twenty_test.target))
複製代碼

咱們能夠看到，相對於樸素貝葉斯，SVM方法獲得的準確率有了很大的進步。工具

Scikit-learn提供了更多的評測工具來更好地幫助咱們進行分類器的性能分析，以下所示，咱們能夠獲得預測結果中關於每一種分類的準確率、召回率、F值等等以及它們的混淆矩陣。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;predicted&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	_ = text_clf.fit(twenty_train.data, twenty_train.target)
	predicted = text_clf.predict(docs_test)
from sklearn import metrics
print(&quot;打印分類性能指標：&quot;)
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))
print(&quot;打印混淆矩陣：&quot;)
metrics.confusion_matrix(twenty_test.target, predicted)
複製代碼

不出所料，經過混淆矩陣咱們能夠發現，相對於計算機圖形學（comp.graphics），與無神論（alt.atheism）以及基督教（soc.religion.christian）相關的兩種分類更難以被區分出來。

使用網格搜索來進行參數優化

咱們已經瞭解了不少機器學習過程當中所遇到的參數，好比TfidfTransformer中的use_idf。分類器每每會擁有不少的參數，好比說樸素貝葉斯算法中包含平滑參數alpha，SVM算法會包含懲罰參數alpha以及其餘一些能夠設置的函數。

爲了不調整這一系列參數而帶來的繁雜工做，咱們可使用網格搜索方法來尋找各個參數的最優值。以下面的例子所示，咱們能夠在採用SVM算法創建分類器時嘗試設置以下參數：使用單詞或是使用詞組、使用IDF或是不使用IDF、懲罰參數爲0.01或是0.001。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;text_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
from sklearn.grid_search import GridSearchCV
# sklearn 0.18.1 版本請使用如下方式導入網格搜索庫
# from sklearn.model_selection import GridSearchCV

# 設置參與搜索的參數
parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
              &#39;tfidf__use_idf&#39;: (True, False),
              &#39;clf__alpha&#39;: (1e-2, 1e-3),
}

# 構建分類器
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
print(gs_clf)
複製代碼

很明顯，逐個進行這樣一個搜索過程會消耗較大的計算資源。若是咱們擁有一個多核CPU平臺，咱們就能夠並行計算這8個任務（每一個參數有兩種取值，三個參數共有個參數組合），這須要咱們修改n_jobs這個參數。若是咱們設置這個參數的值爲-1，網格搜索過程將會自動檢測計算環境所存在的CPU核心數量，並使用所有核心進行並行工做。

一個具體的網格搜索模型與普通的分類器模型一致，咱們可使用一個較小的子數據塊來加快模型的訓練過程。對GridSearchCV對象調用fit方法以後將獲得一個與以前案例相似的分類器，咱們可使用這個分類器來進行預測。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;gs_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
	              &#39;tfidf__use_idf&#39;: (True, False),
	              &#39;clf__alpha&#39;: (1e-2, 1e-3),
	}
	gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
# 使用部分訓練數據訓練分類器
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
# 查看分類器對於新文本的預測結果，你能夠自行改變下方的字符串來觀察分類效果
twenty_train.target_names[gs_clf.predict([&#39;An apple a day keeps doctor away&#39;])[0]]
複製代碼

分類器同時包含best_score_和best_params_兩個屬性，這兩個屬性包含了最佳預測結果以及取得最佳預測結果時的參數配置。固然，咱們也能夠瀏覽gs_clf.cv_results_來獲取更詳細的搜索結果（這是sklearn 0.18.1版本新加入的特性），這個參數能夠很容易地導入到pandas中進行更爲深刻的研究。

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

categories = [&#39;alt.atheism&#39;, &#39;soc.religion.christian&#39;,
              &#39;comp.graphics&#39;, &#39;sci.med&#39;]
if &#39;gs_clf&#39; not in dir() :
	twenty_train=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-train&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	twenty_test=load_files(&#39;/mnt/vol0/sklearn/20news-bydate-test&#39;,categories=categories, load_content = True, 
	                           encoding=&#39;latin1&#39;, decode_error=&#39;strict&#39;,shuffle=True, random_state=42)
	docs_test = twenty_test.data
	from sklearn.linear_model import SGDClassifier
	text_clf = Pipeline([(&#39;vect&#39;, CountVectorizer()),
	                     (&#39;tfidf&#39;, TfidfTransformer()),
	                     (&#39;clf&#39;, SGDClassifier(loss=&#39;hinge&#39;, penalty=&#39;l2&#39;,
	                                           alpha=1e-3, n_iter=5, random_state=42)),
	])
	parameters = {&#39;vect__ngram_range&#39;: [(1, 1), (1, 2)],
	              &#39;tfidf__use_idf&#39;: (True, False),
	              &#39;clf__alpha&#39;: (1e-2, 1e-3),
	}
	gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
print("最佳準確率：%r" % (gs_clf.best_score_))

for param_name in sorted(parameters.keys()):
    print(&quot;%s: %r&quot; % (param_name, gs_clf.best_params_[param_name]))
複製代碼

小結

至此，咱們已經完整實踐了一個使用機器學習方法進行文本分類工做的全過程，咱們瞭解了從網絡獲取數據並進行讀取、清洗原始數據並提取特徵向量、使用不一樣算法來構建分類器、並使用網格搜索方法來進行參數調優等有監督機器學習中較爲常見的各個知識點。關於更爲複雜的一些問題，好比中文文本處理、文本聚類分析等等，咱們將在以後的文章中進行討論。

（本篇課程內容來自於Scikit-Learn - Working With Text Data，轉載請註明來源。）