【學習筆記】Hands On Machine Learning - Chap3. Classification

時間 2019-11-08

標籤學習筆記 hands machine learning chap3 chap classification 简体版

原文原文鏈接

本章首先介紹了 MNIST 數據集，此數據集爲 7 萬張帶標籤的手寫數字（0-9）圖片，它被認爲是機器學習領域的 HelloWorld，不少機器學習算法均可以在此數據集上進行訓練、調參、對比。git

本章核心內容在如何評估一個分類器，介紹了混淆矩陣、Precision 和 Reccall 等衡量正樣本的重要指標，及如何對這兩個指標進行取捨，此外，還介紹了 ROC 曲線及 AUC 值，固然，確定少不了 F1Score 了。github

最後，本章還介紹了構建多分類器的通常方法。做爲科普，你還能夠構建多 label 的分類器，以及每一個 label 可取不一樣 value 的分類器。算法

下面是詳細筆記：bash

MNIST

MNIST 數據集：70000 張手寫數字小圖片。這些圖片被譽爲 ML 中的 Hello World。dom

手動加載 MNIST 步驟：機器學習

下載 mnist-original.mat
調用 sklearn.datasets.base.get_data_home() 查看 sklearn 下載到本地的路徑
將下載後的文件 mnist-original.mat 拷貝到 get_data_home()/mldata 目錄下
調動 fetch_mmldata() 接口，獲取 mnist 對象：如本地有，就不會從網上下載

from sklearn.datasets.base import get_data_home 
from sklearn.datasets import fetch_mldata
print (get_data_home())
mnist = fetch_mldata('MNIST original')
mnist
----
/Users/fengyajie/scikit_learn_data
{'DESCR': 'mldata.org dataset: mnist-original',
 'COL_NAMES': ['label', 'data'],
 'target': array([0., 0., 0., ..., 9., 9., 9.]),
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)}
複製代碼

# 查看數據，該數據中包含 70000 張圖片，每張圖片擁有 784 個 features，
# 由於該圖片的規格爲 28x28，每一個像素的值的範圍是 0(white)-255(black)
X,y = mnist["data"],mnist["target"]
print(X.shape,y.shape)
----
(70000, 784) (70000,)
複製代碼

# 顯示其中一個樣本
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show()
y[36000]
複製代碼

# 測試集和訓練集
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# 打散訓練集，避免類似的圖片都在一塊
import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
複製代碼

訓練二分類器

from sklearn.model_selection import cross_val_score
# 識別數字 5 的分類器，使用 sklearn 提供的隨機梯度降低算法
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
----
array([0.9578 , 0.9607 , 0.96775])
複製代碼

效果評估

上面模型的準確率很高，有一個緣由是其正樣本的比例只有 10%，這種狀況下，即使我所有猜【不是5】，準確率也有 90% 之高，因此通常咱們不用準確率來衡量模型的好壞。學習

混淆矩陣

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
confusion_matrix(y_train_5, y_train_pred)
----
array([[53556,  1023],
       [ 1252,  4169]])
複製代碼

# 輸出Precision score和recall score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
print("precision_score={}, recall_score={}".format(precision_score(y_train_5, y_train_pred), recall_score(y_train_5, y_train_pred)))
print("f1_score={}".format(f1_score(y_train_5, y_train_pred)))
----
precision_score=0.8029661016949152, recall_score=0.7690463014204021
f1_score=0.7856402525204936
複製代碼

Precision/Recall tradeoff

precision 和 recall 每每不能兩全，一個提高了，另外一個會降低，這兩個指標須要進行權衡，例如在判斷視頻節目是否對小孩無害的場景下，咱們但願 precision 越高越好，同時能夠犧牲 recall；而在根據照片預測小偷的場景下，更但願 recall 越高越好。測試

# 繪製 precision 和 recall 曲線
from sklearn.metrics import precision_recall_curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                                 method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds): 
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision") 
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall") 
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
    
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
複製代碼

ROC 曲線

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None): 
    plt.plot(fpr, tpr, linewidth=2, label=label) 
    plt.plot([0, 1], [0, 1], 'k--') 
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr)
plt.show()
複製代碼

# 計算 AUC
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
----
0.9655990736206981
複製代碼

使用 F1Score 仍是 AUC？取決於正樣本和負樣本的比例，若是正樣本較少，你應該選擇 F1Score，不然選擇 AUC。fetch

使用隨機森林優化

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="bottom right")
plt.show()
複製代碼

# 隨機森林的 auc
roc_auc_score(y_train_5, y_scores_forest)
----
0.993283808868663
複製代碼

多分類器

分類器的分類

二分類器：Logistic Regression、SVM
多分類器：Random Forest、Naive Bayes

除此以外，你也可使用二分類器來構造多分類器，例如識別 0-9 十個數字，你能夠訓練 10 個二分類器，每一個分類器用來識別一個數字，當你要預測一個數字時，將該數字分別輸入到這十個分類器中，最後得到最高分的那個分類器，就是預測結果。這種方法也被稱爲 one-versus-all (OvA)

# 在 sklearn 中，其內部會自動訓練多個分類器，而且在預測時給出分數最高的那個分類
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
----
array([5.])
複製代碼

some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
----
array([[-227250.7532523 , -511911.42747987, -343850.9936749 ,
        -194518.44134798, -341796.12282028,   10728.59041333,
        -798149.80620821, -263564.01751255, -729498.66535121,
        -553349.11568488]])
複製代碼

# 最高分數的下標
np.argmax(some_digit_scores)
# 分類
sgd_clf.classes_
sgd_clf.classes_[5]
----
5.0
複製代碼

錯誤分析

# 交叉驗證 + 混淆矩陣
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
----
array([[5757,    4,   17,   11,    9,   38,   35,   11,   38,    3],
       [   2, 6465,   47,   23,    6,   45,    7,   13,  121,   13],
       [  59,   42, 5330,   96,   91,   24,   81,   55,  163,   17],
       [  45,   44,  139, 5352,    0,  227,   35,   56,  130,  103],
       [  22,   26,   37,    8, 5360,    7,   46,   34,   74,  228],
       [  88,   42,   31,  196,   81, 4577,  107,   28,  175,   96],
       [  40,   25,   48,    2,   44,   86, 5616,    9,   48,    0],
       [  24,   19,   69,   32,   58,   10,    4, 5785,   17,  247],
       [  57,  148,   79,  149,   11,  162,   56,   24, 5003,  162],
       [  45,   35,   27,   86,  161,   29,    2,  182,   67, 5315]])
複製代碼

# 使用圖像來表示混淆矩陣
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
複製代碼

由於全部的圖像都在主對角線上，全部該混淆矩陣看上去不錯，5 號分類器看上去顏色深一點，說明它的預測效果沒有其餘分類器好

# 查看錯誤率，row_sums 是每一個分類中實際的樣本數
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx/row_sums 
np.fill_diagonal(norm_conf_mx, 0) # 填充對角線，只留出錯誤的數據
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
複製代碼

8 和 9 兩列比較白，意味着不少數字都錯誤的分類到了 8 和 9 兩個數上；顏色很是深的行，意味着這個數字基本上預測對了，例如 1；

對於你想調優的分類器，你能夠相應的增長樣本；或優化樣本圖片（使用 Scikit-Image, Pillow, or OpenCV），例如使它們都處於圖片正中間，且不要過於偏斜。

多個 label 的分類器

向分類器輸入一組數據，它會輸出多個預測值，例以下面的程序，能夠同時預測圖片是不是大數（>=7）及圖片是不是奇數

from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

# Kneighbors 分類器能夠同時輸出多組預測值
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])
----
array([[False,  True]])
複製代碼

# 你可使用 f1_score + 交叉驗證 的方法來衡量多值分類器的效果
# 若是大數的圖片遠遠多於奇數的圖片，你能夠將對每一個label賦予一個權重，權重值根據其值的佔比來設定
# 方法也很簡單，將下面的參數 average 設爲 average="weighted" 便可
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
f1_score(y_train, y_train_knn_pred, average="macro")
複製代碼