Evaluation Metrics

時間 2019-12-05

標籤 evaluation metrics 简体版

原文原文鏈接

Metrics of Classification and Regression

Classification is about deciding which categories new instances belong to. For example we can organize objects based on whether they are square or round, or we might have data about different passengers on the Titanic like in project 0, and want to know whether or not each passenger survived. Then when we see new objects we can use their features to guess which class they belong to.html

In regression, we want to make a prediction on continuous data. For example we might have a list of different people's height, age, and gender and wish to predict their weight. Or perhaps, like in the final project of this course, we have some housing data and wish to make a prediction about the value of a single home.node

The problem at hand will determine how we choose to evaluate a model.算法

Classification Metrics

機器學習(ML),天然語言處理(NLP),信息檢索(IR)等領域,評估(Evaluation)是一個必要的工做,而其評價指標每每有以下幾點:準確率(Accuracy),精確率(Precision),召回率(Recall)和F1-Measure。(注：相對來講，IR 的 ground truth 不少時候是一個 Ordered List, 而不是一個 Bool 類型的 Unordered Collection，在都找到的狀況下，排在第三名仍是第四名損失並非很大，而排在第一名和第一百名，雖然都是「找到了」，可是意義是不同的，所以更多可能適用於 MAP 之類評估指標。)app

本文將簡單介紹其中幾個概念。中文中這幾個評價指標翻譯各有不一樣，因此通常狀況下推薦使用英文。dom

如今我先假定一個具體場景做爲例子。機器學習

假如某個班級有男生80人,女生20人,共計100人.目標是找出全部女生. 如今某人挑選出50我的,其中20人是女生,另外還錯誤的把30個男生也看成女生挑選出來了. 做爲評估者的你須要來評估(evaluation)下他的工做.

首先咱們能夠計算準確率(accuracy),其定義是: 對於給定的測試數據集，分類器正確分類的樣本數與總樣本數之比。也就是損失函數是0-1損失時測試數據集上的準確率.ide

Accuracy

The most basic and common classification metric is accuracy. Accuracy here is described as the proportion of items classified or labeled correctly.函數

For instance if a classroom has 14 boys and 16 girls, can a facial recognition software correctly identify all boys and all girls? If the software can identify 10 boys and 8 girls, then the software is 60% accurate.post

accuracy = number of correctly identified instances / all instances學習

Accuracy is the default metric used in the .score() method for classifiers in sklearn. You can read more in the documentation here.

這樣說聽起來有點抽象，簡單說就是，前面的場景中，實際狀況是那個班級有男的和女的兩類，某人(也就是定義中所說的分類器)他又把班級中的人分爲男女兩類。accuracy須要獲得的是此君分正確的人佔總人數的比例。很容易，咱們能夠獲得:他把其中70(20女+50男)人斷定正確了,而總人數是100人，因此它的accuracy就是70 %(70 / 100).

由準確率，咱們的確能夠在一些場合，從某種意義上獲得一個分類器是否有效，但它並不老是能有效的評價一個分類器的工做。舉個例子,google抓取了argcv 100個頁面，而它索引中共有10,000,000個頁面,隨機抽一個頁面，分類下,這是否是argcv的頁面呢?若是以accuracy來判斷個人工做，那我會把全部的頁面都判斷爲"不是argcv的頁面",由於我這樣效率很是高(return false,一句話),而accuracy已經到了99.999%(9,999,900/10,000,000),完爆其它不少分類器辛辛苦苦算的值,而我這個算法顯然不是需求期待的,那怎麼解決呢?這就是precision,recall和f1-measure出場的時間了.
(在數據不均勻的狀況下，若是以accuracy做爲評判依據來優化模型的話。很容易將樣本誤判爲佔比較高的類別，而使分類器沒有任何實際做用)

在說precision,recall和f1-measure以前,咱們須要先須要定義TP,FN,FP,TN四種分類狀況. 按照前面例子,咱們須要從一個班級中的人中尋找全部女生,若是把這個任務當成一個分類器的話,那麼女生就是咱們須要的,而男生不是,因此咱們稱女生爲"正類",而男生爲"負類".

相關(Relevant),正類		無關(NonRelevant),負類
被檢索到(Retrieved)	true positives(TP 正類斷定爲正類,例子中就是正確的斷定"這位是女生")	false positives(FP 負類斷定爲正類,"存僞",例子中就是分明是男生卻判斷爲女生,當下僞娘橫行,這個錯常有人犯)
未被檢索到(Not Retrieved)	false negatives(FN 正類斷定爲負類,"去真",例子中就是,分明是女生,這哥們卻判斷爲男生--梁山伯同窗犯的錯就是這個)	true negatives(TN 負類斷定爲負類,也就是一個男生被判斷爲男生,像我這樣的純爺們一準兒就會在此處)

經過這張表,咱們能夠很容易獲得這幾個值: + TP=20 + FP=30 + FN=0 + TN=50

Precision And Recall

Precision: $$\frac{True Positive} {True Positive + False Positive}$$. Out of all the items labeled as positive, how many truly belong to the positive class.

精確率(precision)的公式是$$P=\frac{TP}{TP+FP}$$,它計算的是全部"正確被檢索的item(TP)"佔全部"實際被檢索到的(TP+FP)"的比例.

在例子中就是但願知道此君獲得的全部人中,正確的人(也就是女生)佔有的比例.因此其precision也就是40%(20女生/(20女生+30誤判爲女生的男生)).

Recall: $frac{True Positive }{ True Positive + False Negative}$. Out of all the items that are truly positive, how many were correctly classified as positive. Or simply, how many positive items were 'recalled' from the dataset.

召回率(recall)的公式是$$R=\frac{TP}{TP+FN}$$,它計算的是全部"正確被檢索的item(TP)"佔全部"應該檢索到的item(TP+FN)"的比例。

在例子中就是但願知道此君獲得的女生佔本班中全部女生的比例,因此其recall也就是100%.
$$\frac{20女生}{20女生+ 0誤判爲男生的女生}$$

F1 Score

Now that you've seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0:

$$F1 = \frac{2 \cdot precision \cdot recall}{precision + recall}$$

For more information about F1 score how to use it in sklearn, check out the documentation here.

F1值就是精確值和召回率的調和均值,也就是

$$\frac{2}{F1}=\frac{1}{P}+\frac{1}{R}$$

調整下也就是：

$$F_{1}=\frac{2PR}{P+R}=\frac{2TP}{2TP+FP+FN}$$

須要說明的是,有人列了這樣個公式
$$F_{a}=\frac{(a^2+1)PR}{a^2(P+R)}$$
將F-measure通常化.

F1-measure認爲精確率和召回率的權重是同樣的,但有些場景下,咱們可能認爲精確率會更加劇要,調整參數a,使用Fa-measure能夠幫助咱們更好的evaluate結果.

Regression Metrics

As mentioned earlier for regression problems we are dealing with model that makes continuous predictions. In this case we care about how close the prediction is.

For example with height & weight predictions it is unreasonable to expect a model to 100% accurately predict someone's weight down to a fraction of a pound! But we do care how consistently the model can make a close prediction--perhaps with 3-4 pounds.

Mean Absolute Error

One way to measure error is by using absolute error to find the predicted distance from the true value. The mean absolute error takes the total absolute error of each example and averages the error based on the number of data points. By adding up all the absolute values of errors of a model we can avoid canceling out errors from being too high or below the true values and get an overall error metric to evaluate the model on.

For more information about mean absolute error and how to use it in sklearn, check out the documentation here.

Mean Squared Error

Mean squared is the most common metric to measure model performance. In contrast with absolute error, the residual error (the difference between predicted and the true value) is squared.

Some benefits of squaring the residual error is that error terms are positive, it emphasizes larger errors over smaller errors, and is differentiable. Being differentiable allows us to use calculus to find minimum or maximum values, often resulting in being more computationally efficient.

For more information about mean squared error and how to use it in sklearn, check out the documentation here.

Regression Scoring Functions

In addition to error metrics, scikit-learn contains two scoring metrics which scale continuously from 0 to 1, with values of 0 being bad and 1 being perfect performance.

These are the metrics that you'll use in the project at the end of the course. They have the advantage of looking similar to classification metrics, with numbers closer to 1.0 being good scores and bad scores tending to be near 0.

One of these is the R2 score, which computes the coefficient of determination of predictions for true values. This is the default scoring method for regression learners in scikit-learn.

The other is the explained variance score.

While we will not dive deep into explained variance score and R2 score in this lecture , one important point to remember is that, in general, metrics for regression are such that "higher is better"; that is, higher scores indicate better performance. When using error metrics, such as mean squared error or mean absolute error, we will need to overwrite this preference.

偏差分析

作迴歸分析，經常使用的偏差主要有均方偏差根（RMSE）和R-平方（R2）。

RMSE是預測值與真實值的偏差平方根的均值。這種度量方法很流行（Netflix機器學習比賽的評價方法），是一種定量的權衡方法。

R2方法是將預測值跟只使用均值的狀況下相比，看能好多少。其區間一般在（0,1）之間。0表示還不如什麼都不預測，直接取均值的狀況，而1表示全部預測跟真實結果完美匹配的狀況。

R2的計算方法，不一樣的文獻稍微有不一樣。如本文中函數R2是依據scikit-learn官網文檔實現的，跟clf.score函數結果一致。

What Is Goodness-of-Fit for a Linear Model?

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

The Coefficient of Determination, r-squared

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, $bar{y}$, and a shallow-sloped estimated regression line, $hat{y}$. Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.
SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.
SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065.Do you see where this quantity appears on Minitab's fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:

$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8$$

The sums of squares for this data set tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on Minitab's fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted $r^2$, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this , since SSTO = SSR + SSE, the quantity $r^2$ also equals one minus the ratio of the error sum of squares to the total sum of squares:

$$r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

Since $r^2$ is a proportion, it is always a number between 0 and 1.
If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!
If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is 'explained by' the variation in predictor x.

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why 'explained by' appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a data set is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "$r^2$ ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% merely unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

相關標籤/搜索

evaluation

metrics

deeplearning+metrics

metrics+influxdb+grafana

metrics+spring+influxdb

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。