Appendix A — Mathematics of linear regression
A.1 Least-squares estimators for simple linear regression
Below are the mathematical details for deriving the least-squares estimators for slope (
Suppose we have a data set with
To find the value of
Therefore, we want to find
Let’s focus on
The last line of Equation A.4 is derived from the fact that
From Equation A.4, we know the
The formula for
We will use the following rules to write Equation A.6 in a form that is more recognizable:
where
Applying Equation A.7 and Equation A.8 to Equation A.6, we have
The correlation between
Therefore,
where
Plugging in the formula for
We have found values of
The second partial derivatives are
Both partial derivatives are greater than 0, so we have shown that the estimators
The least-squares estimators for the intercept and slope are
A.2 Matrix representation of linear regression
The matrix representation for linear regression introduced in this section will be used for the remainder of this appendix and in Appendix B. We will provide some linear algebra and matrix algebra details throughout, but we assume understanding of basic linear algebra concepts. Please see Chapter 1 An Introduction to Statistical Learning and online resources for an in-depth introduction to linear algebra.
Suppose we have a data set with
The regression model in Equation A.11 can be represented using vectors and matrices.
Let’s break down the components of Equation A.12.
From Equation A.12 and Equation A.13, we have the following components of the linear regression model in matrix form:
is an vector of the observed responses. is an matrix called the design matrix. The first column is always , a column vector of 1’s, that corresponds to the intercept. The remaining columns contain the observed values of the predictor variables. is a vector of the model coefficients. is an vector of the error terms.
As before the error terms are normally distributed, centered at
The variance of the error terms
This is the matrix notation showing that the error terms are independent and have the same variance
Based on Equation A.12, the equation for the vector of estimated response values,
A.3 Estimating the Coefficients
A.3.1 Least-squares estimation
In matrix notation, the error terms can be written as
As with simple linear regression in Section A.1, the least-squares estimator of the vector of coefficients,
where
Let’s walk through the steps to minimize Equation A.15. We start by expanding the equation.
Note that
Next, we find the partial derivative of Equation A.17 with respect to
Let
Then
Property 1
Let
Property 2
Let
Using the matrix calculus, the partial derivative of Equation A.17 with respect to
Note that
Thus, the least-squares estimator is the
The steps to find this
Similar to Section A.1, we check the second derivative to confirm that we have found a minimum. In matrix representation, the second derivative is the Hessian matrix.
The Hessian matrix,
If the Hessian matrix is…
positive definite, then we found a minimum.
negative definitive, then we found a maximum.
neither, then we found a saddle point.
Thus, the Hessian of Equation A.15 is
Equation A.21 is proportional to
The least-squares estimator in matrix notation is
A.3.2 Geometry of least-squares regression
In Section A.3.1, we used matrix calculus to find
Let
For any vector
If
Because
Using the geometric interpretation of least-squares regression, we found that the vector
A.4 Hat matrix
The fitted values of least-squares regression are
From Equation A.23,
is symmetric ( ). is idempotent ( ).If a vector
is in , then .If a vector
is orthogonal to , then .
From Equation A.23, the hat matrix only depends on the design matrix
In multiple linear regression, the leverage is a measure of how far the
where
The sum of the leverages for all observation
Using
Equation A.26 shows one feature of observations with large leverage. Observations with large leverage tend to have small residuals. In other words, the model tends to pull towards observations with large leverage.
A.5 Assumptions of linear regression
In Section 5.3, we introduced four assumptions of linear regression. Given the matrix form of the linear regression on model in Equation A.12,
- The distribution of the response
given is normal. - The expected value of
is . There is a linear relationship between the response and predictor variables. - The variance
given is . - The error terms in
are independent. This also means the observations are independent of one another.
From these assumptions, we write the distribution of
In Section A.2, we showed Assumption 4 from the distribution of the error terms. Here we will show Assumptions 1 - 3.
Suppose
The distribution of the error terms
Expected value of a vector
Let
Let
Next, let’s show Assumption 2 that the expected value of
Let
Lastly, we show Assumption 3 that
A.6 Distribution of model coefficients
In Section 8.4.1, we introduced the distribution of a model coefficient
Similar to Section A.5, let’s derive each part of this distribution. We’ll start with
Next, we show that
Lastly, we show that the distribution of
From Equation A.33, we see that
A.7 Multicollinearity
Recall the design matrix for a linear regression model with
The design matrix
Let
where
In practice, we rarely have perfect linear dependencies. In fact, this is mathematically why we only include
In Section 8.6 we discussed the practical issues that come from the presence of multicollinearity. These primarily stem from the fact that when there is multicollinearity,
A.8 Maximum Likelihood Estimation
In Section 10.2.3, we introduced the likelihood function to understand the model performance statistics AIC and BIC. We also used a likelihood function to estimate the coefficients in logistic regression (more on this in Appendix B). The likelihood function is a measure of how likely it is we observe our data given particular value(s) of model parameters. When working with likelihood functions, we have fixed data (our observed sample data) and we can try out different values for the model parameters (
Let
In Section A.5, we showed that the vector of responses
where
The data
To make the calculations more manageable, instead of maximizing the likelihood function, we will instead maximize its logarithm, i.e. the log-likelihood function. The values of the parameters that maximize the log-likelihood function are those that maximize the likelihood function. The log-likelihood function is
Given a fixed value of
We previously found
A.9 Variable transformations
In Chapter 9, we introduced regression models with transformations on the response and/or predictor variables. Here we will share some of the mathematical details behind the interpretation of the coefficients in these models.
A.9.1 Transformation on the response variable
In Section 9.2, we introduced a linear regression model with a log transformation on the response variable
In this text,
.
From Chapter 7, we have that the change in
Thus given the model in Equation A.37, when
A.9.2 Transformation on predictor variable(s)
Next, we consider the models introduced in Section 9.3 that have a log transformation on a predictor variable.
Now, we because the predictor
Thus, given the model in Equation A.38, when predictor
A.9.3 Transformation on response and predictor variables
Lastly, we show the mathematics behind the interpretation of a model coefficient
Combining the results from Section A.9.1 and Section A.9.2 and holding all other predictors constant, we have
Therefore, given the model in Equation A.39 with a transformed response variable and transformed predictor, when
This is the span of
.↩︎Note that when there is a single predictor, Equation A.24 and Equation A.25 produce the same result.↩︎
Note that the likelihood function is not the same as a probability function. In a probability function, we fix the model parameters and plug in different values for the data.↩︎
