迴歸分析 Regression

原文:2016-09-28 IBM Intern郝建勇 IBM數據科學家node

概述

迴歸分析(regressionanalysis)是肯定兩種或兩種以上變量間相互依賴的定量關係的一種統計分析方法,運用十分普遍。簡單來講,就是將一系列影響因素和結果擬合出一個方程,而後將這個方程應用到其餘同類事件中,能夠進行預測。迴歸分析按照涉及的自變量的多少,分爲一元迴歸和多元迴歸分析;按照自變量和因變量之間的關係類型,可分爲線性迴歸分析和非線性迴歸分析。本文從迴歸分析的基本概念開始,介紹迴歸分析的基本原理和求解方法,並用python給出迴歸分析的實例,帶給讀者更直觀的認識。python

1、迴歸分析研究的問題

迴歸分析就是應用統計方法,對大量的觀測數據進行整理、分析和研究,從而得出反應事物內部規律性的一些結論。而後運用這個結論去預測同類事件的結果。迴歸分析能夠應用到心理學、醫學、和經濟學等領域,應用十分普遍。segmentfault

2、迴歸分析的基本概念

clipboard.png

3、一元線性迴歸

clipboard.png
clipboard.png
clipboard.png

4、多元線性迴歸

clipboard.png
clipboard.png
clipboard.png
clipboard.png
clipboard.png

5、一元多項式迴歸

clipboard.png

6、多元多項式迴歸

clipboard.png

7、用python作多項式迴歸實例

用python來作多項式的迴歸是很是方便的,若是咱們要本身寫模型的話,就能夠按照前面介紹的方法和公式來寫模型,而後訓練預測,值得一提的是對於前面公式裏的不少矩陣運算,其實咱們能夠用python裏面的NumPy庫來實現,所以,實現前面的多項式迴歸模型相對來講仍是很簡單的。NumPy是一個基於python的科學計算的庫,是Python的一種開源的數字擴展。這種工具可用來存儲和處理大型矩陣,效率仍是很高的。一種見解是NumPy將Python變成一種免費的更強大的MatLab系統。app

言歸正傳,那這裏演示的實例沒有本身寫模型,而是用了scikit-learn裏面的線性模型。
    實驗的數據是這樣的:若是以文本的形式輸入訓練數據,那麼文件就爲多行數據。每一行最後一列爲y值,也就是因變量的值,前幾列是自變量的值,訓練數據若是隻有兩列(一列自變量,一列因變量),那麼線性模型獲得的就是一元多項式迴歸方程,不然就是多元多項式迴歸方程;當以List輸入訓練數據的時候,訓練數據輸入的List形式爲[[1, 2],[3, 4],[5, 6],[7, 8]],訓練數據結果的List形式爲[3, 7, 8, 9],因爲數據來源不一樣,因此訓練模型的方法也有必定的差異。

clipboard.png

先把文本的數據加載成爲線性模型所須要的數據格式:dom

clipboard.png

接下來就是訓練模型:ide

clipboard.png

而後打印迴歸方程的代碼以下:函數

clipboard.png

還能夠用以下方法已有模型來預測輸入數據的值:工具

clipboard.png

調用測試過程:post

clipboard.png

clipboard.png

測試結果示例:測試

clipboard.png

爲何選擇SSE做爲loss function?

$$minimizes \sum_{All Training Points}{}(actual-predicated)\qquad$$ 正負會抵消

$$minimizes \sum_{All_Training_Points}{}|actual-predicated|\qquad$$ 不是連續函數

$$minimizes \sum_{All_Training_Points}{}(actual-predicated)^{2}\qquad$$

SSE的缺點

SSE的值和數據量成正比,不能很好反應迴歸的效果

若是咱們要比較兩個數據集上的迴歸效果,咱們須要用R Score

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:

  • R-squared = Explained variation / Total variation

  • R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.

The Coefficient of Determination, r-squared

Here's a plot illustrating a very weak relationship between y and x. There are two lines on the plot, a horizontal line placed at the average response, $bar{y}$, and a shallow-sloped estimated regression line, $hat{y}$. Note that the slope of the estimated regression line is not very steep, suggesting that as the predictor x increases, there is not much of a change in the average response y. Also, note that the data points do not "hug" the estimated regression line:

image
$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=119.1$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=1827.6$$

The calculations on the right of the plot show contrasting "sums of squares" values:

  • SSR is the "regression sum of squares" and quantifies how far the estimated sloped regression line, $hat{y}_i$, is from the horizontal "no relationship line," the sample mean or $bar{y}$.

  • SSE is the "error sum of squares" and quantifies how much the data points, $y_i$, vary around the estimated regression line, $hat{y}_i$.

  • SSTO is the "total sum of squares" and quantifies how much the data points, $y_i$, vary around their mean, $bar{y}$

Note that SSTO = SSR + SSE. The sums of squares appear to tell the story pretty well. They tell us that most of the variation in the response y (SSTO = 1827.6) is just due to random variation (SSE = 1708.5), not due to the regression of y on x (SSR = 119.1). You might notice that SSR divided by SSTO is 119.1/1827.6 or 0.065.Do you see where this quantity appears on Minitab's fitted line plot?

Contrast the above example with the following one in which the plot illustrates a fairly convincing relationship between y and x. The slope of the estimated regression line is much steeper, suggesting that as the predictor x increases, there is a fairly substantial change (decrease) in the response y. And, here, the data points do "hug" the estimated regression line:
image
$$SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2=6679.3$$

$$SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=1708.5$$

$$SSTO=\sum_{i=1}^{n}(y_i-\bar{y})^2=8487.8$$

The sums of squares for this data set tell a very different story, namely that most of the variation in the response y (SSTO = 8487.8) is due to the regression of y on x (SSR = 6679.3) not just due to random error (SSE = 1708.5). And, SSR divided by SSTO is 6679.3/8487.8 or 0.799, which again appears on Minitab's fitted line plot.

The previous two examples have suggested how we should define the measure formally. In short, the "coefficient of determination" or "r-squared value," denoted $r^2$, is the regression sum of squares divided by the total sum of squares. Alternatively, as demonstrated in this , since SSTO = SSR + SSE, the quantity $r^2$ also equals one minus the ratio of the error sum of squares to the total sum of squares:

$$r^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$

Here are some basic characteristics of the measure:

  • Since $r^2$ is a proportion, it is always a number between 0 and 1.

  • If $r^2$ = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!

  • If $r^2$ = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!

We've learned the interpretation for the two easy cases — when r2 = 0 or r2 = 1 — but, how do we interpret r2 when it is some number between 0 and 1, like 0.23 or 0.57, say? Here are two similar, yet slightly different, ways in which the coefficient of determination r2 can be interpreted. We say either:

$r^2$ ×100 percent of the variation in y is reduced by taking into account predictor x

or:

$r^2$ ×100 percent of the variation in y is 'explained by' the variation in predictor x.

Many statisticians prefer the first interpretation. I tend to favor the second. The risk with using the second interpretation — and hence why 'explained by' appears in quotes — is that it can be misunderstood as suggesting that the predictor x causes the change in the response y. Association is not causation. That is, just because a data set is characterized by having a large r-squared value, it does not imply that x causes the changes in y. As long as you keep the correct meaning in mind, it is fine to use the second interpretation. A variation on the second interpretation is to say, "$r^2$ ×100 percent of the variation in y is accounted for by the variation in predictor x."

Students often ask: "what's considered a large r-squared value?" It depends on the research area. Social scientists who are often trying to learn something about the huge variation in human behavior will tend to find it very hard to get r-squared values much above, say 25% or 30%. Engineers, on the other hand, who tend to study more exact systems would likely find an r-squared value of just 30% merely unacceptable. The moral of the story is to read the literature to learn what typical r-squared values are for your research area!

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

相關文章
相關標籤/搜索