模型評估——定量分析預測的質量

時間 2019-11-07

標籤模型評估定量分析預測質量简体版

原文原文鏈接

https://blog.csdn.net/hustqb/article/details/77922031php

在sklearn庫裏，用於評估模型預測的質量的API一共有3種：html

評分方法：評估者具備score()，面向其要解決的問題，能夠提供一個默認評估標準。這部份內容不在本文中講述，由於它因不一樣的模型而異，因此會在各個模型的介紹文檔中講述。
評分參數：使用交叉驗證的模型評估工具（如model_selection.cross_val_score和model_selection.GridSearchCV）依賴於內部評分策略。這些內容會在其餘章節（評分參數：定義模型評判規則）中討論。
度量函數：metrics模塊實現了針對特定目的評估偏差的功能。這些指標會在下面這些章節詳細介紹：Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.
最後，Dummy estimators對於得到隨機預測的這些指標的基準值是有用的。

聲明：python

本文譯自Python庫sklearn裏的官方文檔
須要讀者對Python有必定的瞭解
本文超長…並且由於水平有限，有些地方翻譯的較爲生硬。

評分參數：定義模型評價規則

模型選擇和評估一般使用工具（好比model_selection.GridSearchCV和model_selection.cross_val_score），使用一個scoring參數控制那個被咱們應用在咱們的評價系統的度量方法。數組

公共案例：預約義值

對於最多見的用例，您能夠指定一些自帶scoring參數的評分對象; 下表顯示了全部可能的值。全部得分手對象遵循慣例：返回值越大，分數越高，模型越優。所以，衡量模型和數據之間距離的度量（如metrics.mean_squared_error）能夠用它們的負值做爲評分對象的返回值。bash

Scoring	Function	Comment
分類
‘accuracy’	`metrics.accuracy_score`
‘average_precision’	`metrics.average_precision_score`
‘f1’	`metrics.f1_score`
‘f1_micro’	`metrics.f1_score`
f1_macro’	`metrics.f1_score`
‘f1_weighted’	`metrics.f1_score`
‘f1_samples’	`metrics.f1_score`
‘neg_log_loss’	`metrics.log_loss`
‘precision’ etc	`metrics.precision_score`
‘recall’ etc	`metrics.recall_score`
‘roc_auc’	`metrics.roc_aur_score`
聚類
‘adjusted_mutual_info_score’	`metrics.adjusted_mutual_info_score`
‘adjusted_rand_score’	`metrics.adjusted_rand_score`
‘completeness_score’	`metrics.completeness_score`
‘fowlkes_mallow	s_score’	`metrics.fowlkes_mallows_score`
‘homogeneity_score’	`metrics.homogeneity_score`
‘mutual_info_score’	`metrics.mutual_info_score`
‘normalized_mutual_info_score’	`metrics.normalized_mutual_info_score`
‘v_measure_score’	`metrics.v_measure_score`
迴歸
‘explained_variance’	`metrics.explained_variance_score`
‘neg_mean_absolute_error’	`metrics.mean_absolute_error`
‘neg_mean_squared_error’	`metrics.mean_squared_error`
‘neg_mean_squared_log_error’	`metrics.mean_squared_log_error`
‘neg_median_absolute_error’	`metrics.median_absolute_error`
‘r2’	`metrics.r2_score`

例子：網絡

from sklearn import svm, datasets # 導入SVM模型庫和數據集庫 from sklearn.model_selection import cross_val_score # 導入模型選擇庫中的交叉驗證分數函數

'''加載鳶尾花數據集''' iris = datasets.load_iris() X, y = iris.data, iris.target clf = svm.SVC(probability=True, random_state=0) # 建立SVC分類器 '''輸出分類器的評分''' score_arr = cross_val_score(clf, X, y, scoring='neg_log_loss') print score_arr # 輸出 [-0.0757138 -0.16816241 -0.07091847] print score_arr.shape # 輸出 (3L,) model = svm.SVC() # 在建立一個SVC分類器 '''應用另外一種不存在的評分方式''' cross_val_score(model, X, y, scoring='wrong_choice') # 會報錯ValueError

PS：ValueError異常列出的值對應於如下部分中描述的測量預測精度的函數。這些評分對象存儲在字典sklearn.metrics.SCORERS中。app

根據度量函數定義你的評分策略

模塊sklearn.metrics還展現了一組測量預測偏差（給出了真實值和預測值）的簡單函數：dom

以_score結尾的函數返回值用於最大化，越高越好
以_error或_loss結尾的函數返回一個值用於最小化，越低越好。當使用make_scorer()函數將其轉換爲評分對象時，請將greater_is_better參數設置爲False（默認爲True）。

可用於各類機器學習任務的指標在下面詳細介紹。
許多指標不會以scoring爲名稱，有時是由於它們須要其餘參數，例如fbeta_score。在這種狀況下，您須要生成一個適當的scorer對象。生成可評估對象進行評分的最簡單的方法是使用make_scorer()函數。該函數將度量轉換爲可用於模型評估的可調用的數據類型（callable）。
一個典型的用例是使用無默認值的參數轉換包裝庫中已存在的度量函數，例如fbeta_score()函數的beta參數：機器學習

from sklearn.metrics import fbeta_score, make_scorer # 導入fbeta評分函數和一個轉換函數 ftwo_scorer = make_scorer(fbeta_score, beta=2) # 經過make_scorer建立新方法 from sklearn.model_selection import GridSearchCV from sklearn.svm import LinearSVC '''待評價的分類模型是線性SVC，超參數域爲{1， 10}， 評分方法爲前面新轉換的方法''' grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

第二個用例是使用make_scorer將一個簡單的python函數構建一個徹底自定義的scorer對象，它可使用幾個參數：函數

你要使用的python函數（在下面的示例中爲my_custom_loss_func()）
明確python函數是否返回一個分數（greater_is_better = True，默認值）或一個損失值（greater_is_better= False）。若是一個損失，得分器對象的python函數的輸出被取負值以符合交叉驗證慣例，更優的模型返回更高的值。
僅用於分類度量的時候：判斷您提供的python函數是否須要連續性判斷（needs_threshold = True）。默認值爲False。
任何其餘參數，如f1_score()函數中的參數：beta和labels。

如下是創建自定義scorer，以及使用greater_is_better參數的示例：

import numpy as np def my_custom_loss_func(ground_truth, predictions): """ 自定義損失函數——其實就是典型的SVM損失函數 groud_truth: 真實值 predictions: 預測值 """ diff = np.abs(ground_truth - predictions).max() return np.log(1 + diff)

loss  = make_scorer(my_custom_loss_func, greater_is_better=False)  # 建立爲scorer
score = make_scorer(my_custom_loss_func, greater_is_better=True)
ground_truth = [[1], [1]] predictions = [0, 1] from sklearn.dummy import DummyClassifier clf = DummyClassifier(strategy='most_frequent', random_state=0) # 建立簡陋分類模型 clf = clf.fit(ground_truth, predictions) # 訓練模型 loss(clf,ground_truth, predictions) # 損失值，False：取負 score(clf,ground_truth, predictions) # 評分， True：取正

這個代碼是沒有實際意義的，只是爲了體現兩點

自定義評分函數時，須要用到真實值和預測值，而後根據本身的想法輸出預測的偏差或者分值，這裏輸出的是偏差。

make_scorer()函數的參數greater_is_better是爲了適應靈活的自定義評分函數的。當你的評分函數輸出損失值的時候，參數爲False，即對損失值取負；輸出分值的時候，反之。這都是爲了聽從前面所說的越大越優的慣例。

應用你本身的評分對象

您能夠通從頭開始構建本身的評分對象，而不使用make_scorer()工廠函數，這樣生成的scorer模型更靈活。要成爲scorer，須要符合如下兩個規則所指定的協議：

可使用參數(estimator，X，y)調用，其中estimator是應該評估的模型，X是驗證數據，y是X（在監督的狀況下）的真實標籤或None（在無監督的狀況下）。
它返回一個浮點數，用於量化X上的estimator參考y的預測質量。再次，按照慣例，更高的值表示更好的預測模型，因此若是你的返回的是損失值，應取負。

使用多種度量指標

Scikit-learn還容許在GridSearchCV，RandomizedSearchCV和cross_validate中使用多個度量指標。爲評分參數指定多個評分指標有兩種方法：

做爲一個包含字符串的迭代器：

scoring = ['accuracy', 'precision']

做爲一個scorer名字到scorer函數的映射：

from sklearn.metrics import accuracy_score from sklearn.metrics import make_scorer scoring = {'accuracy': make_scorer(accuracy_score), 'prec': 'precision'}

請注意，字典中的值能夠是scorer函數，也能夠是一個預約義的度量指標的字符串。目前，只有那些返回單一分數的scorer函數才能在dict內傳遞。不容許返回多個值的Scorer函數，而且須要一個包裝器才能返回一個度量：

from sklearn.model_selection import cross_validate # 交叉驗證函數 from sklearn.metrics import confusion_matrix X, y = datasets.make_classification(n_classes=2, random_state=0) # 一個簡陋的二分類數據集 svm = LinearSVC(random_state=0) def tp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0] def tn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 0] def fp(y_true, y_pred): return confusion_matrix(y_true, y_pred)[1, 0] def fn(y_true, y_pred): return confusion_matrix(y_true, y_pred)[0, 1] scoring = {'tp' : make_scorer(tp), 'tn' : make_scorer(tn), 'fp' : make_scorer(fp), 'fn' : make_scorer(fn)} cv_results = cross_validate(svm.fit(X, y), X, y, scoring=scoring) print(cv_results['test_tp']) # 獲取test數據集的tp值 [12 13 15] print(cv_results['test_fn']) # 獲取test數據集的fn值 [5 4 1]

分類度量

sklearn.metrics模塊實現了幾種損失函數、評分函數和功能函數來測量分類性能。某些指標可能須要正例，置信度值或二進制決策值的機率估計。大多數指標應用的是：經過sample_weight參數，讓每一個樣本爲總分提供加權貢獻。
其中一些僅限於二進制分類案例：


precision_recall_curve(y_true, probas_pred)
roc_curve(y_true, y_score[, pos_label, …])

還有一些僅限於多分類情形：


cohen_kappa_score(y1, y2[, labels, weights, …])
confusion_matrix(y_true, y_pred[, labels, …])
hinge_loss(y_true, pred_decision[, labels, …])
matthews_corrcoef(y_true, y_pred[, …])

還有一些可用於多標籤情形：


accuracy_score(y_true, y_pred[, normalize, …])
classification_report(y_true, y_pred[, …])
f1_score(y_true, y_pred[, labels, …])
fbeta_score(y_true, y_pred, beta[, labels, …])
hamming_loss(y_true, y_pred[, labels, …])
jaccard_similarity_score(y_true, y_pred[, …])
log_loss(y_true, y_pred[, eps, normalize, …])
precision_recall_fscore_support(y_true, y_pred)
precision_score(y_true, y_pred[, labels, …])
recall_score(y_true, y_pred[, labels, …])
zero_one_loss(y_true, y_pred[, normalize, …])

一些是一般用於分級的：


dcg_score(y_true, y_score[, k])
ndcg_score(y_true, y_score[, k])

並且許可能是用於二分類和多標籤問題的，但不適用於多分類：


average_precision_score(y_true, y_score[, …])
roc_auc_score(y_true, y_score[, average, …])

下面的章節中，咱們會描述每個函數。

從二分類到多分類多標籤

一些度量基本上是針對二進制分類任務（例如f1_score，roc_auc_score）定義的。在這些狀況下，假設默認狀況下，正類標記爲1（儘管能夠經過pos_label參數進行配置），默認狀況下僅評估正標籤。
將二進制度量擴展爲多類或多標籤問題時，數據將被視爲二分類問題的集合，即一對多類型。下面是綜合二分類度量的值的多種方法，不一樣的方法可能適用於不一樣的狀況，經過參數average來選擇：

"macro"簡單地計算二分類度量的平均值，賦予每一個類別相同的權重。
"weighted"計算的是每一個二分類度量的加權平均。
"micro"，每一個二分類對整體度量的貢獻是相等的（除了做爲樣本權重的結果）。
"sample"僅適用於多標籤問題。它不計算每一個類別的度量，而是計算評估數據中每一個樣本的真實和預測類別的度量，並返回（sample_weight-weighted）平均值。
選擇average = None將返回每一個類的分值。

這裏不懂的能夠看下面章節——多分類多標籤分類的例子
多標籤數據用於指標評估時，也像二分類的標籤傳入一個類別標籤的數組，傳入的是一個標籤矩陣，元素ij在樣本i的標籤是j的時候取值爲1，不然爲0。

精確度

accuracy_score函數用於計算預測的精度。在多標籤分類中，函數返回子集精度。若是樣本的整套預測標籤與真正的標籤組一致，則子集精度爲1.0; 不然爲0.0。若是 ${\hat{y}}_{i}$

import numpy as np from sklearn.metrics import accuracy_score y_pred = [0, 2, 1, 3] y_true = [0, 1, 2, 3] accuracy_score(y_true, y_pred) # 用比值的方式輸出精度 輸出：0.5 accuracy_score(y_true, y_pred, normalize=False) # 用計數的方式輸出精度 輸出：2

'''多標籤的狀況''' accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) # 輸出 0.5

解釋一下多標籤的狀況，先看樣本0——第一行，它的真實標籤是[0, 1]，預測標籤爲[1, 1]，預測錯誤。樣本1——第二行，它的真實標籤是[1, 1]，預測標籤是[1, 1]，預測正確。因此最後的預測精度爲0.5。若是將參數normalize設爲False的話，將輸出1，即表示預測正確了一個。

Cohen’s kappa

函數cohen_kappa_score()計算Cohen’s kappa統計量。這個方法是想比較不一樣的人表計的正確率，而非針對分類器。kappa分數是-1和1之間的數字。0.8以上的分數一般被認爲是不錯的結果; 零或更低表示沒有效果（就像瞎蒙的同樣）。
Kappa分數可用於二分類問題或多分類問題，但不能用於多重標籤問題。

from sklearn.metrics import cohen_kappa_score
y_true = [2, 0, 2, 2, 0, 1] y_pred = [0, 0, 2, 2, 0, 2] cohen_kappa_score(y_true, y_pred) # 輸出0.4285...

混亂矩陣

confusion_matrix()函數經過計算混亂矩陣來評估分類精度。根據定義，元素ij表示：真實值爲i，預測值爲j的狀況的個數，例子以下：

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1] y_pred = [0, 0, 2, 2, 0, 2] confusion_matrix(y_true, y_pred)

解釋一下，對於元素（0， 0）的值爲2，表示真實值爲0，預測爲0的次數爲2次；元素（2，0）的值爲1，表示真實值爲2，預測爲0的次數爲1次。矩陣的全部元素和爲2+1+1+2 = 6，表示一共有6個樣本參與預測。下圖是一個關於混亂矩陣的例子：

y_true = [0, 0, 0, 1, 1, 1, 1, 1] y_pred = [0, 1, 0, 1, 0, 1, 0, 1] '''t表示正例，f表示反例，n表示預測錯誤，p表示預測正確''' tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel() tn, fp, fn, tp # 輸出 (2, 1, 2, 3)

分類報告

函數classification_report()建立了一個展現主要分類度量指標的文本報告，這是一個使用自定義target_name和內部標籤的小例程。

from sklearn.metrics import classification_report y_true = [0, 1, 2, 2, 0] y_pred = [0, 0, 2, 1, 0] target_names = ['class 0', 'class 1', 'class 2'] cls_rpt = classification_report(y_true, y_pred, target_names=target_names) print cls_rpt print type(cls_rpt)

能夠看到，函數classification_report()的運行結果是一個字符串，形如一個列表：列名是各類評分方法，行名是自定義的標籤名字。

漢明損失

函數hamming_loss()計算兩組樣本之間的平均漢明損失或漢明距離。若是 ${\hat{y}}_{j}$

是否是感受很熟悉，哈哈，回憶一下以前的accuracy_score()。可是，漢明損失前面分式的分母是標籤個數而不像精確度同樣是樣本個數，因此它們在多標籤問題的結果不同。漢明損失頗像下面的Jaccard相關係數。

from sklearn.metrics import hamming_loss
y_pred = [1, 2, 3, 4] y_true = [2, 2, 3, 4] hamming_loss(y_true, y_pred) # 二分類單一標籤 輸出 0.25

hamming_loss(np.array([[0, 1], [1, 1]]), np.zeros((2, 2))) # 二分類多標籤 輸出：0.75

Jaccard 類似性相關係數

jaccard_similarity_score()函數計算標籤集對之間的平均（默認）或Jaccard類似係數的總和，也稱爲Jaccard索引。設真實標籤集 $y_{i}$

真實標籤集與預測標籤集的交集就是預測正確的部分；對於並集來講，通常狀況下，預測機的結果都不會超綱，這種狀況下並集就是真實標籤集。然而在負載狀況下，預測機的結果可能超出了真實標籤集。
確實跟二分類或多分類的accuracy_score()很類似。可是對於多標籤問題，它們是大相徑庭的結果。

import numpy as np from sklearn.metrics import jaccard_similarity_score y_pred = [0, 2, 1, 3] y_true = [0, 1, 2, 3] jaccard_similarity_score(y_true, y_pred) # 輸出 0.5

jaccard_similarity_score(y_true, y_pred, normalize=False) # 輸出 0.75

到這裏，都是你們常見的幾種評分方式，也比較好理解，諸君可能以爲有點無聊，那麼下面就是重頭戲了。

準確率、召回率和F度量

直觀地說，準確率(presicion)是分類器不錯判的能力，而且召回率(recall)是分類器找到正例的能力。

F-measure()（ $F_{β}$
precision_recall_curve()經過改變斷定閾值來計算關於真實值的presicion-recall曲線和分類器分值。
average_precision_score()函數根據預測分數計算出平均精度（AP）。該分數對應於presicion-recall曲線下的面積。該值在0和1之間，並且更高更好。
下面是幾個計算準確率、召回率和F度量的函數：

function	comment
average_precision_score(y_true, y_score[, …]	計算平均準確度（AP）
f1_score(y_true, y_pred[, labels, …])	計算F1分數
fbeta_score(y_true, y_pred, beta[, labels, …])	計算F-beta分數
precision_recall_curve(y_true, probas_pred)	計算precision-recall曲線
precision_recall_fscore_support(y_true, y_pred)	計算準確率、召回率、F度量和類別數
precision_score(y_true, y_pred[, labels, …])	計算準確度
recall_score(y_true, y_pred[, labels, …])	計算召回率

PS：precision_recall_curve()函數僅限於二分類；average_precision_score()函數僅限於二分類和多標籤。

二分類

在二進制分類任務中，術語positive和negative是指分類器的預測狀況，術語true和false是指該預測是否對應於外部判斷有時被稱爲’觀察」）。給出這些定義，咱們能夠制定下表：

	真實類別 (observation)
預測類別(expectation)	tp (true positive) 預測爲positive，預測正確	fp (false positive) 預測爲positive，預測錯誤
預測類別(expectation)	fn (false negative)預測爲negetive，預測錯誤	tn (true negative)預測爲negetive，預測正確

定義準確率、召回率和F度量：
$precision = \frac{t p}{t p + f p}$

from sklearn import metrics y_pred = [0, 1, 0, 0] y_true = [0, 1, 0, 1] print metrics.precision_score(y_true, y_pred) # 準確率 輸出：1.0 print metrics.recall_score(y_true, y_pred) # 召回率 輸出： 0.5 print metrics.f1_score(y_true, y_pred) # f1分數 輸出：0.66 print metrics.fbeta_score(y_true, y_pred, beta=0.5) # f-beta 輸出： 0.83 print metrics.fbeta_score(y_true, y_pred, beta=1) # 輸出： 0.66，與f1分數相等 print metrics.fbeta_score(y_true, y_pred, beta=2) # 輸出： 0.55 print metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5) # 最後輸出數組(array([ 0.66..., 1. ]), array([ 1. , 0.5]), array([ 0.71..., 0.83...]), array([2, 2]...))

最後一個，函數metrics.precision_recall_fscore_support()的輸出比較複雜，其實它是輸出了不少結果。首先樣本數據一共有2中標籤：0和1，因此輸出是成對的，分別對應着兩種類別。而後4組數據分別表示：準確率、召回率、f度量和支持的類別數。

import numpy as np from sklearn.metrics import precision_recall_curve from sklearn.metrics import average_precision_score y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) precision, recall, threshold = precision_recall_curve(y_true, y_scores) '''能夠看到準確率和召回率是負相關的''' print precision # 輸出： array([ 0.66..., 0.5 , 1. , 1. ]) print recall # 輸出： array([ 1. , 0.5, 0.5, 0. ]) print threshold # 輸出： array([ 0.35, 0.4 , 0.8 ]) print average_precision_score(y_true, y_scores) # 輸出： 0.83...

多分類多標籤

在多類和多標籤分類任務中，準確率、召回率和F度量的概念能夠獨立地應用於每一個類別。如上所述，有幾種cross標籤的方法：特別是由average參數指定爲average_precision_score（僅限多標籤），f1_score，fbeta_score，precision_recall_fscore_support，precision_score和recall_score函數。
聲明一波記號：

y表示預測值
$\hat{y}$
L表示標籤集
S表示樣本集
$y_{s}$
$y_{l}$
${\hat{y}}_{s}$
$P (A, B) := \frac{| A \cap B |}{| A |}$
$R (A, B) := \frac{| A \cap B |}{| B |}$
$F_{β} (A, B) := (1 + β^{2}) \frac{P (A, B) \times R (A, B)}{β^{2} P (A, B) + R (A, B)}$

from sklearn import metrics y_true = [0, 1, 2, 0, 1, 2] y_pred = [0, 2, 1, 0, 0, 1] print metrics.precision_score(y_true, y_pred, average='macro') # 輸出：0.22 print metrics.recall_score(y_true, y_pred, average='micro') # 輸出：0.33 print metrics.f1_score(y_true, y_pred, average='weighted') # 輸出：0.26 print metrics.fbeta_score(y_true, y_pred, average='macro', beta=0.5) # 輸出：0.23 print metrics.precision_recall_fscore_support(y_true, y_pred, beta=0.5, average=None) #輸出：數組以下 # (array([ 0.66..., 0. , 0. ]), array([ 1., 0., 0.]), array([ 0.71..., 0. , 0. ]), array([2, 2, 2]...))

解釋一下：對於標籤0， 1， 2，它們的準確率分別是 $\frac{2}{3}$

print metrics.recall_score(y_true, y_pred, labels=[1, 2], average='micro') # 輸出：0.0 print metrics.precision_score(y_true, y_pred, labels=[0, 1, 2, 3], average='macro') # 輸出：0.166..

上面代碼顯示了label參數的做用。

Hinge損失

hinge_loss()函數使用hinge損失計算模型和數據之間的平均距離，這是僅考慮預測偏差的單側度量。（Hinge損失用於最大化邊緣分類器，如SVM支持向量機）若是標籤用+1和-1編碼，則y是真實值，w是由decision_function()輸出的預測值，則hinge損失定義爲：
$L_{Hinge} （ y ， w ） = max {1 - w y ， 0} = {| 1 - w y |}_{+}$

from sklearn import svm
from sklearn.metrics import hinge_loss
X = [[0], [1]] y = [-1, 1] est = svm.LinearSVC(random_state=0) est.fit(X, y) # 訓練好線性分類器 pred_decision = est.decision_function([[-2], [3], [0.5]]) print pred_decision # 輸出預測值：array([-2.18..., 2.36..., 0.09...]) print hinge_loss([-1, 1, 1], pred_decision) # 輸出：0.3...

下面的例子演示了多分類問題中，使用svm分類器和hinge_loss函數：

X = np.array([[0], [1], [2], [3]]) Y = np.array([0, 1, 2, 3]) labels = np.array([0, 1, 2, 3]) est = svm.LinearSVC() est.fit(X, Y) pred_decision = est.decision_function([[-1], [2], [3]]) y_true = [0, 2, 3] hinge_loss(y_true, pred_decision, labels) # 輸出：0.56...

Log損失

log損失也稱logistic迴歸損失或交叉熵損失，是一個創建在機率上的定義。它一般用於（多項式）邏輯迴歸和神經網絡，以及指望最大化的一些變體中。
log損失是一個負值：
$L_{\log} (y ， p) = - \log \Pr (y | p) = - (y \log (p) + (1 - y) \log (1 - p))$

from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1] y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]] log_loss(y_true, y_pred) # 輸出：0.1738...

注意，這裏的預測值是機率。好比y_pred[0]爲[0.9, 0.1]表示，預測爲0的機率是0.9，預測爲1的機率是0.1。
下面將0.1738是怎麼計算出來的。首先應該知道，在log損失中，y的值只有0和1，其機率的和是1。
仍是對於第一個樣本y_pred[0]，其log損失爲 $- (0 \times l o g (0.9) + 1 \times l o g (0.9))$

Matthews 相關係數

matthews_corrcoef函數計算二進制類的Matthews相關係數（MCC）。引用維基百科：

「The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.」

大意：

Matthews相關係數（MCC）用於機器學習，做爲二分類質量的量度。它考慮到正例反例、判斷對判斷錯，一般被認爲是可使用的平衡措施，即便類別大小極其不一樣。 MCC本質上是-1和+1之間的相關係數值。係數+1表示完美預測，0表示平均隨機預測，-1表示反向預測。統計學也稱爲phi係數。

在二分類的狀況下，tp，tn，fp和fn分別是真正例，真反例，假正例和假反例數，MCC定義爲
$M C C = \frac{t p \times t n - f p \times f n}{\sqrt{(t p + f p) (t p + f n) (t n + f p) (t n + f n)}}$

$t_{k} = \sum_{i}^{K} C_{i k}$
$p_{k} = \sum_{i}^{K} C_{k i}$
$c = \sum_{k}^{K} C_{k k}$
$s = \sum_{i}^{K} \sum_{j}^{K} C_{i j}$

而後將多類MCC定義爲：
$M C C = \frac{c \times s - \sum_{k}^{K} p_{k} \times t_{k}}{\sqrt{(s^{2} - \sum_{k}^{K} p_{k}^{2}) \times (s^{2} - \sum_{k}^{K} t_{k}^{2})}}$

from sklearn.metrics import matthews_corrcoef
y_true = [+1, +1, +1, -1] y_pred = [+1, -1, +1, +1] matthews_corrcoef(y_true, y_pred) # 輸出：-0.33

總之，Matthews相關係數也是一個根據真實值和預測值對預測模型進行評分的一個方法

ROC

函數roc_curve()計算 receiver operating characteristic curve或者說是ROC曲線。引用維基百科：

「A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.」

大意：

ROC或者是ROC曲線，是一個圖形圖，說明了二分類系統在鑑別閾值變化的情形下的性能。它經過在不一樣閾值設置下，TPR（正例中判對的機率）與FPR（被錯判爲正例的機率）。TPR又稱爲敏感度，FPR是1減去反例判對的機率。

該函數須要正確的二分類值和目標分數，這能夠是正類的機率估計，置信度值或二分類判決。這是一個如何使用roc_curve()函數的小例子：

import numpy as np from sklearn.metrics import roc_curve y = np.array([1, 1, 2, 2]) scores = np.array([0.1, 0.4, 0.35, 0.8]) fpr, tpr, thresholds = roc_curve(y, scores, pos_label=2) print fpr # 輸出：[ 0. 0.5 0.5 1. ] print tpr # 輸出：[ 0.5 0.5 1. 1. ] print thresholds # 輸出：[ 0.8 0.4 0.35 0.1 ]

thresholds閾值的意義：全部的樣本都要和這個閾值比較，若是大於這個閾值就是正例。因此，這個閾值越大，反例被錯判爲正例的可能性越小，固然，不少比較小的正例也會被忽略從而致使TPR也較小。

下圖是一個ROC曲線的例子：

roc_auc_score()函數計算ROC曲線下面積，也由AUC或AUROC表示。經過計算roc曲線下的面積，曲線信息總結爲一個數字。

import numpy as np from sklearn.metrics import roc_auc_score y_true = np.array([0, 0, 1, 1]) y_scores = np.array([0.1, 0.4, 0.35, 0.8]) roc_auc_score(y_true, y_scores) # 輸出：0.75

在多標籤分類中，roc_auc_score()函數經過如上所述的標籤平均來擴展。
與諸如子集精確度，漢明損失或F1度量相比，ROC不須要優化每一個標籤的閾值。若是預測的輸出已被二進制化，則roc_auc_score()函數也可用於多類分類。

0-1損失

zero_one_loss()函數經過 $n_{samples}$

from sklearn.metrics import zero_one_loss y_pred = [1, 2, 3, 4] y_true = [2, 2, 3, 4] print zero_one_loss(y_true, y_pred) # 輸出：0.25 表示預測錯誤的百分比是0.25 print zero_one_loss(y_true, y_pred, normalize=False) # 輸出：1 表示有1個預測錯誤

print zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) # 輸出：0.5 print zero_one_loss(np.array([[0, 1], [1, 1]]), np.ones((2, 2)), normalize=False) # 輸出1

彷佛跟前面的accuracy_score()有點像（我以爲是同樣的）。

Brier score損失

brier_score_loss()函數計算二分類的Brier分數。引用維基百科：

The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes.

Brier分數在用於衡量預測的準確性方面，是一個合適的評分函數。它適用於預測必須將機率分配給一組相互排斥的離散結果的任務。

該函數返回實際結果與可能結果的預測機率之間的均方差的得分。實際結果必須爲1或0（真或假），而實際結果的預測機率能夠是0到1之間的值。Brier損失也在0到1之間，分數越低（均方差越小），預測越準確。它能夠被認爲是對一組機率預測的「校準」的度量。
$B S = \frac{1}{N} \sum_{t = 1}^{N} (f_{t} - o_{t})^{2}$

from sklearn.metrics import brier_score_loss
y_true = np.array([0, 1, 1, 0]) y_true_categorical = np.array(["spam", "ham", "ham", "spam"]) y_prob = np.array([0.1, 0.9, 0.8, 0.4]) y_pred = np.array([0, 1, 1, 0]) print brier_score_loss(y_true, y_prob) # 輸出；0.055 print brier_score_loss(y_true, 1-y_prob, pos_label=0) # 輸出；0.055 print brier_score_loss(y_true_categorical, y_prob, pos_label="ham") # 輸出；0.055 print brier_score_loss(y_true, y_prob > 0.5) # 輸出；0

能夠看到與以前不一樣的地方：真實標籤y_true輸入的無論是字符串仍是數字都沒有規定正例or反例，而是經過brier_score_loss()函數的pos_lable參數決定的。

多標籤分級度量

在多標籤學習中，每一個樣本能夠具備與之相關的任何數量的真實標籤，最後靠近真實標籤的得到更高的分值和等級。

Coverage偏差

coverage_error()函數計算包含在最終預測中的標籤的平均數，以便預測全部真正的標籤。若是您想知道有多少頂級評分標籤，您必須平均預測，而不會丟失任何真正的標籤，這頗有用。所以，此指標的最好的狀況（也就是最小值）是正確標籤的平均數量。
注意：咱們的實現的分數比Tsoumakas等人在2010年提供的分數大1。這樣就能夠包含一種特例：實例的正確標籤爲0。
給定一個二進制的真實標籤矩陣，y $\in {0, 1}^{n_{samples} \times n_{labels}}$

import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 0, 1]]) y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) coverage_error(y_true, y_score) # 輸出：2.5

首先：這是一個用於多標籤的偏差計算函數。其次，具體怎麼算的筆者也不太清楚，歡迎你們在回覆裏指教，最後這個偏差的意義我是知道的：
上面例子的y_score其實是一個rank（能夠想成排名，積分）列表，只要真實標籤中的1對應的y_score中的值（也就是它的rank）是最大的，則偏差最小，最小爲 $\frac{n_{l a b e l = 1}}{n_{s a m p l e s}}$

分級標籤平均準確度

label_ranking_average_precision_score()函數實現標籤rank的平均精度（LRAP）。該度量值與average_precision_score()函數相關，可是基於標籤rank的信息，而不是準確率和召回率。
標籤rank的平均精度（LRAP）是分配給每一個樣本的，rank通常較高真實標籤對rank通常較低的總標籤的比率的平均值。若是可以爲每一個樣本相關標籤提供更好的rank，這個指標就會產生更好的分數。得到的分數老是嚴格大於0，最佳值爲1。若是每一個樣本只有一個相關標籤，則標籤排名平均精度等於mean reciprocal rank。
給定一個二進制標籤矩陣，即 $y \in R^{n_{samples} \times n_{labels}}$

import numpy as np
from sklearn.metrics import label_ranking_average_precision_score
y_true = np.array([[1, 0, 0], [0, 0, 1]]) y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) label_ranking_average_precision_score(y_true, y_score) # 輸出：0.416...

這個的公式只能說比上一個coverage_error()函數的方法更復雜，可是它們是同根同源的，解決的是同一個問題。只要知道這個函數的最大值是1，最小值大於0就好了。

分級損失

label_ranking_loss()函數計算在樣本上的排序錯誤（即正例的rank低於反例）的標籤的rank損失的平均值（由正例和反例的倒數加權），最小值爲0。
給定一個二進制標籤矩陣 $y \in {0, 1}^{n_{samples} \times n_{labels}}$

import numpy as np
from sklearn.metrics import label_ranking_loss
y_true = np.array([[1, 0, 0], [0, 0, 1]]) y_score = np.array([[0.75, 0.5, 1], [1, 0.2, 0.1]]) label_ranking_loss(y_true, y_score) # 輸出：0.75

y_score = np.array([[1.0, 0.1, 0.2], [0.1, 0.2, 0.9]]) # 這樣就能輸出最小loss label_ranking_loss(y_true, y_score) # 輸出：0.0

再次，公式不懂，可是label_ranking_loss()仍是與上面兩種都是同一問題解決方法的不一樣表達。

迴歸度量

sklearn.metrics模塊實現了幾個損失，評分和工具函數來衡量回歸表現。其中一些已被加強以處理多輸出情形：mean_squared_error()，mean_absolute_error()，explain_variance_score()和r2_score()。
這些函數具備一個multioutput關鍵字參數，用於指定平均每一個目標的分數或損失的方式。默認值爲'uniform_average'，它指定輸出均勻加權均值。若是傳遞形如(n_outputs,)的數組，則將其解釋爲權重，並返回相應的加權平均值。若是指定了多重輸出爲'raw_values'，則全部未更改的單個分數或損失將以形狀數組(n_outputs,)返回。
r2_score()和interpret_variance_score()爲multioutput參數接受一個附加值'variance_weighted'。該選項經過相應目標變量的方差導出每一個單獨得分的加權。此設置量化了全局捕獲的未縮放的方差。若是目標變量的scale範圍不一樣，則該分數更好地解釋較高的方差變量。對於向後兼容性，multioutput ='variance_weighted'是r2_score()的默認值。未來會更改成uniform_average。

Explained variance score

explain_variance_score()計算explained variance regression score.。
若是 $\hat{y}$

from sklearn.metrics import explained_variance_score
y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] explained_variance_score(y_true, y_pred) # 輸出：0.957.。。

y_true = [[0.5, 1], [-1, 1], [7, -6]] y_pred = [[0, 2], [-1, 2], [8, -5]] print explained_variance_score(y_true, y_pred, multioutput='raw_values') # 輸出：[ 0.967..., 1. ] print explained_variance_score(y_true, y_pred, multioutput=[0.3, 0.7]) # 輸出：0.990...

兩個方差的比值？解釋方差？這個分值有什麼用？

均值絕對偏差

mean_absolute_error()函數計算平均絕對偏差，對應於絕對偏差損失或l1範數損失的預期值的風險度量。若是 ${\hat{y}}_{i}$

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] mean_absolute_error(y_true, y_pred) # 輸出：0.5 y_true = [[0.5, 1], [-1, 1], [7, -6]] y_pred = [[0, 2], [-1, 2], [8, -5]] mean_absolute_error(y_true, y_pred) # 輸出：0.75

print mean_absolute_error(y_true, y_pred, multioutput='raw_values') # 輸出：【0.5， 1】 print mean_absolute_error(y_true, y_pred, multioutput=[0.3, 0.7]) # 輸出：0.849...

均值平方偏差

mean_squared_error()函數計算均方偏差，與平方（二次）偏差或損失的預期值對應的風險度量。若是 ${\hat{y}}_{i}$

from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] print mean_squared_error(y_true, y_pred) # 輸出：0.375 y_true = [[0.5, 1], [-1, 1], [7, -6]] y_pred = [[0, 2], [-1, 2], [8, -5]] print mean_squared_error(y_true, y_pred) # 輸出：0.7083

均值平方對數偏差

mean_squared_log_error()函數計算對應於平方對數（二次）偏差或損失的預期值的風險度量。若是 ${\hat{y}}_{i}$

注意，該度量對低於真實值的預測更加敏感。

這是一個使用mean_squared_log_error函數的小例子：

from sklearn.metrics import mean_squared_log_error
y_true = [3, 5, 2.5, 7] y_pred = [2.5, 5, 4, 8] print mean_squared_log_error(y_true, y_pred) # 輸出：0.039.。 y_true = [[0.5, 1], [1, 2], [7, 6]] y_pred = [[0.5, 2], [1, 2.5], [8, 8]] print mean_squared_log_error(y_true, y_pred) # 輸出：0.044

中位數絕對偏差

median_absolute_error()是很是有趣的，由於它能夠減弱異常值的影響。經過取目標和預測之間的全部絕對差值的中值來計算損失。若是 ${\hat{y}}_{i}$

median_absolute_error()不支持multioutput。

這是一個使用median_absolute_error()函數的小例子：

from sklearn.metrics import median_absolute_error
y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] median_absolute_error(y_true, y_pred) # 輸出：0.5

$R^{2}$

r2_score()函數計算R^2，即肯定係數，能夠表示特徵模型對特徵樣本預測的好壞。最佳分數爲1.0，能夠爲負數（由於模型可能會更糟）。對於一個老是在預測y的指望值時不關注輸入特徵的連續模型，它的R^2分值是0.0。若是 ${\hat{y}}_{i}$

from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] print r2_score(y_true, y_pred) # 0.948... y_true = [[0.5, 1], [-1, 1], [7, -6]] y_pred = [[0, 2], [-1, 2], [8, -5]] print r2_score(y_true, y_pred, multioutput='variance_weighted') # 0.938... y_true = [[0.5, 1], [-1, 1], [7, -6]] y_pred = [[0, 2], [-1, 2], [8, -5]] print r2_score(y_true, y_pred, multioutput='uniform_average') # 0.936... print r2_score(y_true, y_pred, multioutput='raw_values') # [ 0.965..., 0.908...] print r2_score(y_true, y_pred, multioutput=[0.3, 0.7]) # 0.925...

聚類度量

sklearn.metrics模塊實現了多種損失函數、評分函數和工具函數。有關更多信息，請參閱集聚類性能評估部分。

簡陋評分器

在進行監督學習的過程當中，簡單清晰的檢查包括將一個估計模型與簡單的經驗法則進行比較。 DummyClassifier實現了幾種簡單的分類策略：

stratified根據訓練數據類的分佈產生隨機數據。
most_frequent老是將結果預測爲訓練集中最經常使用的標籤。
prior老是預測爲先前最大的那個類。
uniform產生均勻隨機預測。
constant老是預測爲用戶提供的常量標籤。這種方法的主要靠的是F1分值，此時的正例數量較小。

爲了說明DummyClassifier，首先讓咱們建立一個不平衡的數據集：

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris() X, y = iris.data, iris.target y[y != 1] = -1 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

而後再來比較一下SVC和most_frequent：

from sklearn.dummy import DummyClassifier from sklearn.svm import SVC clf = SVC(kernel='linear', C=1).fit(X_train, y_train) print clf.score(X_test, y_test) # 0.63... clf = DummyClassifier(strategy='most_frequent',random_state=0) clf.fit(X_train, y_train) print clf.score(X_test, y_test) # 0.57...

能夠看出，SVC的效果不比簡陋的分類器（dummy classifier）號多少，下面，讓咱們換一下svc的核函數：

clf = SVC(kernel='rbf', C=1).fit(X_train, y_train) clf.score(X_test, y_test) # 0.97...

咱們看到準確率提升到近100％。建議使用交叉驗證策略，以便更好地估計精度。更通常地說，當分類器的準確度太接近隨機時，這可能意味着出現了一些問題：特徵沒有幫助，超參數沒有正確調整，數據分佈不平衡等
DummyRegressor還實現了四個簡單的經驗法則：

mean：預測訓練目標的平均值。
median：預測訓練目標的中位數。
quantile：預測用戶提供的分數量的訓練目標。
constant：預測用戶提供的常數值。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。