Least Squares Estimates
Below are the mathematical details for deriving the least-squares estimates for slope () and intercept (). We obtain the estimates and by finding the values that minimize the sum of squared residuals (Equation A.1) .
We find the values of and that minimize Equation A.1 by taking the partial derivatives and setting them to 0. Thus, the values of and that minimize the respective partial derivative also minimize the sum of squared residuals. The partial derivatives are
Let’s begin by finding .
Thus .
Now, let’s find .
To write in a form that’s more recognizable, we will use the following:
where is the covariance of and , and is the sample variance of ( is the sample standard deviation).
Thus, applying Equation A.5 and Equation A.6, we have
The correlation between and is . Thus, , where is the sample standard deviation of . Plugging this into Equation A.7, we have
The least-squares estimates for the intercept and slope are
Sum of Squares
From Section 4.7.2, we have the following
where is the number of observations. We will show why Equation A.8 is true mathematically by starting with the fact that
We can sum over both sides to get
For now, let’s focus on the middle term
Plugging Equation A.10 back into Equation A.9, we have
Thus
Matrix representation of multiple linear regression
This section provides the details for the matrix notation for multiple linear regression. We assume the reader has familiarity with some linear algebra. Please see Chapter 1 of An Introduction to Statistical Learning for a brief review of linear algebra.
Introduction
Suppose we have observations. Let the be , such that are the explanatory variables (predictors) and is the response variable. We assume the data can be modeled using the least-squares regression model, such that the mean response for a given combination of explanatory variables follows the form in Equation A.11 .
We can write the response for the observation as shown in Equation A.12.
such that is the amount deviates from , the mean response for a given combination of explanatory variables. We assume each , where is a constant variance for the distribution of the response for any combination of explanatory variables .
Matrix Representation for the Regression Model
We can represent the model using matrix notation.
Therefore the estimated response for a given combination of explanatory variables and the associated residuals can be written as
Estimating the Coefficients
The least-squares model is the one that minimizes the sum of the squared residuals. Therefore, we want to find the coefficients, that minimizes
where , the transpose of the matrix .
Note that . Since these are both constants (i.e. vectors), . Thus, () becomes
Since we want to find the that minimizes (), will find the value of such that the derivative with respect to is equal to 0.
Thus, the estimate of the model coefficients is .
Variance-covariance matrix of the coefficients
We will use two properties to derive the form of the variance-covariance matrix of the coefficients:
First, we will show that
Recall, the regression assumption that the errors are Normally distributed with mean 0 and variance . Thus, for all . Additionally, recall the regression assumption that the errors are uncorrelated, i.e. for all . Using these assumptions, we can write () as
where is the identity matrix.
Next, we show that .
Recall that the and . Then,
Using these two properties, we derive the form of the variance-covariance matrix for the coefficients. Note that the covariance matrix is
Model diagnostics
Suppose we have observations. Let the be , such that are the explanatory variables (predictors) and is the response variable. We assume the data can be modeled using the least-squares regression model, such that the mean response for a given combination of explanatory variables follows the form in Equation A.11.
We can write the response for the observation as shown in Equation A.12.
such that is the amount deviates from , the mean response for a given combination of explanatory variables. We assume each , where is a constant variance for the distribution of the response for any combination of explanatory variables .
Hat Matrix & Leverage
Combining () and (), we can write as the following:
We define the hat matrix as an matrix of the form . Thus () becomes
The diagonal elements of the hat matrix are a measure of how far the predictor variables of each observation are from the means of the predictor variables. For example, is a measure of how far the values of the predictor variables for the observation, , are from the mean values of the predictor variables, . In the case of simple linear regression, the diagonal, , can be written as
We call these diagonal elements, the leverage of each observation.
The diagonal elements of the hat matrix have the following properties:
- , where is the number of predictor variables in the model.
- The mean hat value is .
Using these properties, we consider a point to have high leverage if it has a leverage value that is more than 2 times the average. In other words, observations with leverage greater than are considered to be high leverage points, i.e. outliers in the predictor variables. We are interested in flagging high leverage points, because they may have an influence on the regression coefficients.
When there are high leverage points in the data, the regression line will tend towards those points; therefore, one property of high leverage points is that they tend to have small residuals. We will show this by rewriting the residuals from () using ().
Note that the identity matrix and hat matrix are idempotent, i.e. , . Thus, is also idempotent. These matrices are also symmetric. Using these properties and (), we have that the variance-covariance matrix of the residuals , is
where is the estimated regression variance. Thus, the variance of the residual is . Therefore, the higher the leverage, the smaller the variance of the residual. Because the expected value of the residuals is 0, we conclude that points with high leverage tend to have smaller residuals than points with lower leverage.
Standardized Residuals
In general, we standardize a value by shifting by the expected value and rescaling by the standard deviation (or standard error). Thus, the standardized residual takes the form
The expected value of the residuals is 0, i.e. . From (), the standard error of the residual is . Therefore,
Cook’s Distance
Cook’s distance is a measure of how much each observation influences the model coefficients, and thus the predicted values. The Cook’s distance for the observation can be written as
where is the vector of predicted values from the model fitted when the observation is deleted. Cook’s Distance can be calculated without deleting observations one at a time, since () below is mathematically equivalent to ().
Model Selection Criteria
Maximum Likelihood Estimation of and
To understand the formulas for AIC and BIC, we will first briefly explain the likelihood function and maximum likelihood estimates for regression.
Let be matrix of responses, , the matrix of predictors, and , matrix of coefficients. If the multiple linear regression model is correct then,
When we do linear regression, our goal is to estimate the unknown parameters and from (). In Matrix Form of Linear Regression, we showed a way to estimate these parameters using matrix algebra. Another approach for estimating and is using maximum likelihood estimation.
A likelihood function is used to summarize the evidence from the data in support of each possible value of a model parameter. Using (), we will write the likelihood function for linear regression as
where is the response and is the vector of predictors for the observation. One approach estimating and is to find the values of those parameters that maximize the likelihood in (), i.e. maximum likelihood estimation. To make the calculations more manageable, instead of maximizing the likelihood function, we will instead maximize its logarithm, i.e. the log-likelihood function. The values of the parameters that maximize the log-likelihood function are those that maximize the likelihood function. The log-likelihood function we will maximize is
[–insert details MLES–]
The maximum likelihood estimate of and are
where is the residual sum of squares. Note that the maximum likelihood estimate is not exactly equal to the estimate of we typically use . This is because the maximum likelihood estimate of in () is a biased estimator of . When is much larger than the number of predictors , then the differences in these two estimates are trivial.
AIC
Akaike’s Information Criterion (AIC) is
where is the log-likelihood. This is the general form of AIC that can be applied to a variety of models, but for now, let’s focus on AIC for multiplef linear regression.