翻譯來自:http://news.csdn.net/article_preview.html?preview=1&reload=1&arcid=2825492html
摘要:本文解釋了迴歸分析及其優點,重點總結了應該掌握的線性迴歸、邏輯迴歸、多項式迴歸、逐步迴歸、嶺迴歸、套索迴歸、ElasticNet迴歸等七種最經常使用的迴歸技術及其關鍵要素,最後介紹了選擇正確的迴歸模型的關鍵因素。git
【編者按】迴歸分析是建模和分析數據的重要工具。本文解釋了迴歸分析的內涵及其優點,重點總結了應該掌握的線性迴歸、邏輯迴歸、多項式迴歸、逐步迴歸、嶺迴歸、套索迴歸、ElasticNet迴歸等七種最經常使用的迴歸技術及其關鍵要素,最後介紹了選擇正確的迴歸模型的關鍵因素。app
迴歸分析是一種預測性的建模技術,它研究的是因變量(目標)和自變量(預測器)之間的關係。這種技術一般用於預測分析,時間序列模型以及發現變量之間的因果關係。例如,司機的魯莽駕駛與道路交通事故數量之間的關係,最好的研究方法就是迴歸。less
迴歸分析是建模和分析數據的重要工具。在這裏,咱們使用曲線/線來擬合這些數據點,在這種方式下,從曲線或線到數據點的距離差別最小。我會在接下來的部分詳細解釋這一點。dom
如上所述,迴歸分析估計了兩個或多個變量之間的關係。下面,讓咱們舉一個簡單的例子來理解它:ide
好比說,在當前的經濟條件下,你要估計一家公司的銷售額增加狀況。如今,你有公司最新的數據,這些數據顯示出銷售額增加大約是經濟增加的2.5倍。那麼使用迴歸分析,咱們就能夠根據當前和過去的信息來預測將來公司的銷售狀況。函數
使用迴歸分析的好處良多。具體以下:工具
迴歸分析也容許咱們去比較那些衡量不一樣尺度的變量之間的相互影響,如價格變更與促銷活動數量之間聯繫。這些有利於幫助市場研究人員,數據分析人員以及數據科學家排除並估計出一組最佳的變量,用來構建預測模型。性能
有各類各樣的迴歸技術用於預測。這些技術主要有三個度量(自變量的個數,因變量的類型以及迴歸線的形狀)。咱們將在下面的部分詳細討論它們。學習
對於那些有創意的人,若是你以爲有必要使用上面這些參數的一個組合,你甚至能夠創造出一個沒有被使用過的迴歸模型。但在你開始以前,先了解以下最經常使用的迴歸方法:
它是最爲人熟知的建模技術之一。線性迴歸一般是人們在學習預測模型時首選的技術之一。在這種技術中,因變量是連續的,自變量能夠是連續的也能夠是離散的,迴歸線的性質是線性的。
線性迴歸使用最佳的擬合直線(也就是迴歸線)在因變量(Y)和一個或多個自變量(X)之間創建一種關係。
用一個方程式來表示它,即Y=a+b*X + e,其中a表示截距,b表示直線的斜率,e是偏差項。這個方程能夠根據給定的預測變量(s)來預測目標變量的值。
一元線性迴歸和多元線性迴歸的區別在於,多元線性迴歸有(>1)個自變量,而一元線性迴歸一般只有1個自變量。如今的問題是「咱們如何獲得一個最佳的擬合線呢?」。
如何得到最佳擬合線(a和b的值)?
這個問題可使用最小二乘法輕鬆地完成。最小二乘法也是用於擬合迴歸線最經常使用的方法。對於觀測數據,它經過最小化每一個數據點到線的垂直誤差平方和來計算最佳擬合線。由於在相加時,誤差先平方,因此正值和負值沒有抵消。
咱們可使用R-square指標來評估模型性能。想了解這些指標的詳細信息,能夠閱讀:模型性能指標Part 1,Part 2 .
要點:
邏輯迴歸是用來計算「事件=Success」和「事件=Failure」的機率。當因變量的類型屬於二元(1 / 0,真/假,是/否)變量時,咱們就應該使用邏輯迴歸。這裏,Y的值從0到1,它能夠用下方程表示。
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
上述式子中,p表述具備某個特徵的機率。你應該會問這樣一個問題:「咱們爲何要在公式中使用對數log呢?」。
由於在這裏咱們使用的是的二項分佈(因變量),咱們須要選擇一個對於這個分佈最佳的連結函數。它就是Logit函數。在上述方程中,經過觀測樣本的極大似然估計值來選擇參數,而不是最小化平方和偏差(如在普通迴歸使用的)。
要點:
對於一個迴歸方程,若是自變量的指數大於1,那麼它就是多項式迴歸方程。以下方程所示:
y=a+b*x^2
在這種迴歸技術中,最佳擬合線不是直線。而是一個用於擬合數據點的曲線。
重點:
在處理多個自變量時,咱們可使用這種形式的迴歸。在這種技術中,自變量的選擇是在一個自動的過程當中完成的,其中包括非人爲操做。
這一壯舉是經過觀察統計的值,如R-square,t-stats和AIC指標,來識別重要的變量。逐步迴歸經過同時添加/刪除基於指定標準的協變量來擬合模型。下面列出了一些最經常使用的逐步迴歸方法:
這種建模技術的目的是使用最少的預測變量數來最大化預測能力。這也是處理高維數據集的方法之一。
嶺迴歸分析是一種用於存在多重共線性(自變量高度相關)數據的技術。在多重共線性狀況下,儘管最小二乘法(OLS)對每一個變量很公平,但它們的差別很大,使得觀測值偏移並遠離真實值。嶺迴歸經過給迴歸估計上增長一個誤差度,來下降標準偏差。
上面,咱們看到了線性迴歸方程。還記得嗎?它能夠表示爲:
y=a+ b*x
這個方程也有一個偏差項。完整的方程是:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
在一個線性方程中,預測偏差能夠分解爲2個子份量。一個是誤差,一個是方差。預測錯誤可能會由這兩個份量或者這兩個中的任何一個形成。在這裏,咱們將討論由方差所形成的有關偏差。
嶺迴歸經過收縮參數λ(lambda)解決多重共線性問題。看下面的公式
在這個公式中,有兩個組成部分。第一個是最小二乘項,另外一個是β2(β-平方)的λ倍,其中β是相關係數。爲了收縮參數把它添加到最小二乘項中以獲得一個很是低的方差。
要點:
它相似於嶺迴歸,Lasso (Least Absolute Shrinkage and Selection Operator)也會懲罰迴歸係數的絕對值大小。此外,它可以減小變化程度並提升線性迴歸模型的精度。看看下面的公式:
Lasso 迴歸與Ridge迴歸有一點不一樣,它使用的懲罰函數是絕對值,而不是平方。這致使懲罰(或等於約束估計的絕對值之和)值使一些參數估計結果等於零。使用懲罰值越大,進一步估計會使得縮小值趨近於零。這將致使咱們要從給定的n個變量中選擇變量。
要點:
· 若是預測的一組變量是高度相關的,Lasso 會選出其中一個變量而且將其它的收縮爲零。
ElasticNet是Lasso和Ridge迴歸技術的混合體。它使用L1來訓練而且L2優先做爲正則化矩陣。當有多個相關的特徵時,ElasticNet是頗有用的。Lasso 會隨機挑選他們其中的一個,而ElasticNet則會選擇兩個。
Lasso和Ridge之間的實際的優勢是,它容許ElasticNet繼承循環狀態下Ridge的一些穩定性。
要點:
除了這7個最經常使用的迴歸技術,你也能夠看看其餘模型,如Bayesian、Ecological和Robust迴歸。
當你只知道一個或兩個技術時,生活每每很簡單。我知道的一個培訓機構告訴他們的學生,若是結果是連續的,就使用線性迴歸。若是是二元的,就使用邏輯迴歸!然而,在咱們的處理中,可選擇的越多,選擇正確的一個就越難。相似的狀況下也發生在迴歸模型中。
在多類迴歸模型中,基於自變量和因變量的類型,數據的維數以及數據的其它基本特徵的狀況下,選擇最合適的技術很是重要。如下是你要選擇正確的迴歸模型的關鍵因素:
/*******************************************************************************************/
Linear and Logistic regressions are usually the first algorithms people learn in predictive modeling. Due to their popularity, a lot of analysts even end up thinking that they are the only form of regressions. The ones who are slightly more involved think that they are the most important amongst all forms of regression analysis.
The truth is that there are innumerable forms of regressions, which can be performed. Each form has its own importance and a specific condition where they are best suited to apply. In this article, I have explained the most commonly used 7 forms of regressions in a simple manner. Through this article, I also hope that people develop an idea of the breadth of regressions, instead of just applying linear / logistic regression to every problem they come
across and hoping that they would just fit!
Regression analysis is a form of predictive modelling technique which investigates the relationship between adependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized. I’ll explain this in more details in coming sections.
As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information.
There are multiple benefits of using regression analysis. They are as follows:
Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.
There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics (number of independent variables, type of dependent variables and shape of regression line). We’ll discuss them in detail in the following sections.
For the creative ones, you can even cook up new regressions, if you feel the need to use a combination of the parameters above, which people haven’t used before. But before you start that, let us understand the most commonly used regressions:
It is one of the most widely known modeling technique. Linear regression is usually among the first few topics which people pick while learning predictive modeling. In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one or moreindependent variables (X) using a best fit straight line (also known as regression line).
It is represented by an equation Y=a+b*X + e, where a is intercept, b is slope of the line and e is error term. This equation can be used to predict the value of target variable based on given predictor variable(s).
The difference between simple linear regression and multiple linear regression is that, multiple linear regression has (>1) independent variables, whereas simple linear regression has only 1 independent variable. Now, the question is 「How do we obtain best fit line?」.
This task can be easily accomplished by Least Square Method. It is the most common method used for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Because the deviations are first squared, when added, there is no cancelling out between positive and negative values.
We can evaluate the model performance using the metric R-square. To know more details about these metrics, you can read: Model Performance metrics Part 1, Part 2 .
Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can represented by following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of presence of the characteristic of interest. A question that you should ask here is 「why have we used log in the equation?」.
Since we are working here with a binomial distribution (dependent variable), we need to choose a link function which is best suited for this distribution. And, it is logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation:
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.
This feat is achieved by observing statistical values like R-square, t-stats and AIC metric to discern significant variables. Stepwise regression basically fits the regression model by adding/dropping co-variates one at a time based on a specified criterion. Some of the most commonly used Stepwise regression methods are listed below:
The aim of this modeling technique is to maximize the prediction power with minimum number of predictor variables. It is one of the method to handle higher dimensionality of data set.
Ridge Regression is a technique used when the data suffers from multicollinearity ( independent variables are highly correlated). In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors.
Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
In a linear equation, prediction errors can be decomposed into two sub components. First is due to the biased and second is due to the variance. Prediction error can occur due to any one of these two or both components. Here, we’ll discuss about the error caused due to variance.
Ridge regression solves the multicollinearity problem through shrinkage parameter λ (lambda). Look at the equation below.
In this equation, we have two components. First one is least square term and other one is lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to least square term in order to shrink the parameter to have a very low variance.
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing the variability and improving the accuracy of linear regression models. Look at the equation below: Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk towards absolute zero. This results to variable selection out of given n variables.
ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2 prior as regularizer. Elastic-net is useful when there are multiple features which are correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.
A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to inherit some of Ridge’s stability under rotation.
Beyond these 7 most commonly used regression techniques, you can also look at other models like Bayesian,Ecological and Robust regression.
Life is usually simple, when you know only one or two techniques. One of the training institutes I know of tells their students – if the outcome is continuous – apply linear regression. If it is binary – use logistic regression! However, higher the number of options available at our disposal, more difficult it becomes to choose the right one. A similar case happens with regression models.
Within multiple types of regression models, it is important to choose the best suited technique based on type of independent and dependent variables, dimensionality in the data and other essential characteristics of the data. Below are the key factors that you should practice to select the right regression model:
By now, I hope you would have got an overview of regression. These regression techniques should be applied considering the conditions of data. One of the best trick to find out which technique to use, is by checking the family of variables i.e. discrete or continuous.
In this article, I discussed about 7 types of regression and some key facts associated with each technique. As somebody who’s new in this industry, I’d advise you to learn these techniques and later implement them in your models.
From: http://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/