The covariance between two random variables is a statistical measure of the degree to which the two variables move together.
The covariance captures the linear relationship between two variables. A positive covariance indicates that the variables tend to move together; a negative covariance indicates that the variables tend to move in opposite directions.express
The sample covariance is calculated as:app
The actual value of the covariance is not very meaningful because its measurement is extremely sensitive to the scale of the two variables. Also, the covariance may range from negative to positive infinity, and it is presented in terms of squared units (e.g., percent squared when data are in percent). For these reasons, we take the additional step of calculating the correlation coefficient, which coverts the covariance into a standardized measure that is easier to interpret.less
The correlation coefficient, r, is a measure of the strength of the linear relationship (correlation) between two variables. The correlation coefficient has no unit of measurement; it is a "pure" measure of the tendency of two variables to move together.dom
The sample correlation coefficient for two variables, X and Y, is calculated as:ide
The correlation coefficient is bounded by positive and negative 1 (i.e., -1 <= r <= 1), where a correlation coefficient of +1 indicates that changes in the variables are perfectly positively correlated (i.e., they go up and down together, in lock-step). In contrast, if the correlation coefficient is -1, the changes in the variables are perfectly negatively correlated.ui
The interpretation of the possible correlation values is summarized in the following figure,this
A scatter plot is a collection of points on a graph where each point represents the values of two variables(i.e. an X/Y pair).3d
Note that for r=1 and r=-1 the data points lie exactly on a line, but the slope of that line is not necessarily +1 or -1.rest
Outliers represent a few extreme values for sample observations. Relative to the rest of the sample data, the value of an outlier may be extraordinarily large or small. Outliers can result in apparent statistical evidence that a significant relationship exists when, in fact, there is none, or that there is no relationship when, in fact, there is a relationship.code
Spurious correlation refers to the appearance of a causal linear relationship when, in fact, there is no relation. Certain data items may be highly correlated purely by chance.
Correlation measures the linear relationship between two variables, it does not capture strong nonlinear relationships between variables.
The closer the correlation coefficient is to +1 or -1, the stronger the correlation. With the exception of these extremes(i.e. r=+/-1) we cannot really speak of the strength of the relationship indicated by the correlation coefficient without a statistical test of significance.
For our purpose, we want to test whether the correlation between the population of two variables is equal to zero.
Assuming that the two populations are normally distributed, we can use a t-test to determine whether the null hypothesis should be rejected. The test statistic is computed using the sample correlation, r, with n-2 degrees of freedom(df):
To make a decision, the calculated test statistic is compared with the critical t-value for the appropriate degrees of freedom and level of significance. Bearing in mind that we are conducting a two-tailed test, the decision rule can be stated as:
The purpose of simple linear regression is to explain the variation in a dependent variable in terms of the variation in a single independent variable. Here, the term "variation" is interpreted as the degree to which a variable differs from its mean value. Don't confuse variation with variance -- they are related but are not the same.
Linear regression requires a number of assumptions. As indicated in the following list, most of the major assumptions pertain to the regression model's residual term ε.
The variance of the residual term is constant for all observations.
The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation.
The residual term is normally distributed.
The following linear regression model is used to describe the relationship between two variables, X and Y:
Based on the regression model stated previously, the regression process estimates an equation for a line through a scatter plot of the data that "best" explains the observed values for Y in terms of the observed values for X.
The linear equation, often called the line of the best fit, or regression line, takes the following form:
The regression line is just one of the many possible lines that can be drawn through the scatter plot of X and Y. In fact, the criteria used to estimate this line forms the very essence of linear regression. The regression line is the line for which the estimates of b0 and b1 are such that the sum of the squared differences (vertical distance) between the Y-values predicated by the regression equation and the actual Y-values is minimized. The sum of the squared vertical distances between the estimated and the actual Y-values is referred to as the sum of squared errors (SSE).
Thus, the regression line is the line that minimizes the SSE. This explains why simple linear regression is frequently referred to as ordinary least squares (OLS) regression, and the values estimated by the estimated regression equation are called least squares estimates.
The estimated slop coefficient for the regression line describes the change in Y for one unit change in X. It can be positive, negative, or zero, depending on the relationship between the regression variables. The slope term is calculated as:
The intercept term is the line's intersection with the Y-axis at X=0. It can be positive, negative, or zero. A property of the least squares method is that the intercept term may be expressed as:
The intercept equation highlights the fact that the regression line passes through a point with coordinates equal to the mean of the independent and dependent variables.
Keep in mind that any conclusion regarding the importance of an independent variable in explaining a dependent variable require determining the statistical significance of the slope coefficient. Simply looking at the magnitude of the slope coefficient does not address the issue of the importance of the variable. A hypothesis test must be conducted, or a confidence interval must be formed, to assess the importance of the variable.
The standard error of estimate(SEE) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SEE gauges the "fit" of the regression line. The smaller the standard error; the better the fit.
The SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.
The coefficient of determination(R^2) is defined as the percentage of the total variation in the dependent variable explained by the independent variable. For example, an R^2 of 0.63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable.
For simple linear regression(i.e. one independent variable), the coefficient of determination, R^2, may be computed by simply squaring the correlation coefficient, r. In other words, R^2=r^2 for a regression with one independent variable. This approach is not appropriate when more then one independent variable is used in the regression.
Hypothesis testing for a regression coefficient may use the confidence interval for the coefficient being tested.
The confidence interval for the regression coefficient, b1, is calculated as:
In this expression, tc is the critical two-tailed t-value for the selected confidence level with the appropriate number of degrees of freedom, which is equal to the number of observations minus 2. (i.e. n-2)
The standard error of the regression coefficient is denoted as Sb1. It is a function of the SEE: as SEE raises, Sb1 also increases, and the confidence interval widens. This makes sense because SEE measures the variability of the data about the regression line, and the more variable the data, the less confidence there is in the regression model to estimate coefficient.
A t-test may also be used to test the hypothesis that the true slope coefficient, b1, is equal to some hypothesized value. Letting b1^ be the point estimate for b1, the appropriate test statistic with n-2 degrees of freedom is:
The decision rule for tests of significance for regression coefficient is:
Rejection of the null means that the slope coefficient is different from the hypothesized value of b1.
To test whether an independent variable explains the variation in the dependent variable (i.e. it is statistically significant), the hypothesis that is tested is whether the true slope is zero (b1=0). The appropriate test structure for the null and alternative hypothesis is:
Confidence intervals for the predicated value of a dependent variable are calculated in a manner similar to the confidence interval for the regression coefficients.
The challenge with computing a confidence interval for a predicated value is calculating sf.
Analysis of variance(ANOVA) is a statistical procedure for analyzing the total variability of the dependent variable.
Note: this is not the same as variance. Variance = SST/(n-1)
Thus, total variation = explained variation + unexplained variation, or:
SST = RSS + SSE
The output of the ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. ANOVA tables are included in the regression of output of many statistical software packages.
A generic ANOVA table for a simple linear regression(one independent variable) is presented in the following figure,
The mean regression sum of squares(MSR) and mean squared error(MSE) are simply calculated as the appropriate sum of squares divided by its degree of freedom.
The R^2 and the standard error of estimate(SEE) can also be calculated directly from the ANOVA table. The R^2 is the percentage of the total variation in the dependent variable explained by the independent variable:
The SEE is the standard deviation of the regression error terms and is equal to the square root of the mean squared error (MSE):
Note: SSE is the sum of the squared residuals,, while SEE is the standard deviation of the residuals.
An F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable. In multiple regression, the F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variable.
The F-statistic is calculated as:
In multiple regression, the F-statistic tests all independent variables as a group.
For simple linear regression, there is only one independent variable, so the F-statistic tests the same hypothesis as the t-test for statistical significant of the slope coefficient:
To determine whether b1 is statistically significant using the F-test, the calculated F-statistic is compared with the critical F-value, Fc, at the appropriate level of significance. The degrees of freedom for the numerator and the denominator with one independent variable are:
The decision rule for the F-test is:
Rejection of the null hypothesis as a stated level of significance indicates that the independent variable is significantly different than zero, which is interpreted to mean that it makes a significant contribution to the explanation of the dependent variable. In simple linear regression, it tells us the same thing as the t-test of the slope coefficient.In fact, in simple linear regression with one independent variable, F=tb1^2.
Linear relationships can change over time. This means that the estimation equation based on data from a specific time period may not be relevant for forecasts or predictions in another time period. This is referred to as parameter instability.
Even if the regression model actually reflects the historical relationship between the two variables, its usefulness in investment analysis will be limited if other market participants are also aware of and act on this evidence.
If the assumptions underlying regression analysis do not hold, the interpretation and tests of hypotheses may not be valid.