Appendix A — Mathematics of linear regression

A.1 Least-squares estimators for simple linear regression

Below are the mathematical details for deriving the least-squares estimators for slope (β1) and intercept (β0) using calculus. We obtain the estimators β^0 and β^1 by finding the values of β0 and β1, respectively, that minimize the sum of squared residuals ().

Suppose we have a data set with n observations (x1,y1),(x2,y2),,(xn,yn). Then the sum of squared residuals is

(A.1)SSR=i=1n[yiy^i]2=[yi(β0+β1xi)]2=[yiβ0β1xi]2

To find the value of β0 that minimizes , we begin by taking the partial derivative of with respect to β0. Similarly, we take the partial derivative with respect to β1 to find the value of β1 that minimizes . The partial derivatives are

(A.2)SSRβ0=2i=1n(yiβ0β1xi)SSRβ1=2i=1nxi(yiβ0β1xi)

Therefore, we want to find β^0 and β^1 that satisfy the following:

(A.3)2i=1n(yiβ^0β^1xi)=02i=1nxi(yiβ^0β^1xi)=0

Let’s focus on β^0 for now and find β^0 that satisfies the first equality in ?eq-par-deriv-estimators. The mathematical steps are below in

(A.4)2i=1n(yiβ^0β^1xi)=0i=1n(yii=1nβ^0i=1nβ^1xi)=0i=1nyinβ^0β^1i=1nxi=0i=1nyiβ^1i=1nxi=nβ^01n(i=1nyiβ^1i=1nxi)=β^0y¯β^1x¯=β^0

The last line of is derived from the fact that y¯=1ni=1nyi and x¯=1ni=1nxi.

From , we know the β^0 that satisfies the first equality in is

(A.5)β^0=y¯β^1x¯

The formula for β^0 contains β^1, so now let’s find the value of β^1 that satisfies the second equality in . The mathematical steps are below in .

(A.6)2i=1nxi(yiβ^0β^1xi)=0i=1nxiyiβ^0i=1nxiβ^1i=1nxi2=0(Fill in β^0)i=1nxiyi(y¯β^1x¯)i=1nxiβ^1i=1nxi2=0i=1nxiyi=(y¯β^1x¯)i=1nxi+β^1i=1nxi2i=1nxiyi=y¯i=1nxiβ^1x¯i=1nxi+β^1i=1nxi2(Given x¯=i=1nxi)i=1nxiyi=ny¯x¯β^1nx¯2+β^1i=1nxi2i=1nxiyiny¯x¯=β^1i=1nxi2β^1nx¯2i=1nxiyiny¯x¯=β^1(i=1nxi2nx¯2)i=1nxiyiny¯x¯i=1nxi2nx¯2=β^1

We will use the following rules to write in a form that is more recognizable:

(A.7)i=1nxiyiny¯x¯=i=1n(xix¯)(yiy¯)=(n1)Cov(x,y)

(A.8)i=1nxi2nx¯2=i=1n(xix¯)2=(n1)sx2

where Cov(x,y) is the covariance of x and y, and sx2 is the sample variance of x (sx is the sample standard deviation).

Applying and to , we have

(A.9)β^1=i=1nxiyiny¯x¯i=1nxi2nx¯2=i=1n(xix¯)(yiy¯)i=1n(xix¯)2=(n1)Cov(x,y)(n1)sx2=Cov(x,y)sx2

The correlation between x and y is r=Cov(x,y)sxsy

Therefore, Cov(x,y)=rsxsy

where sx and sy are the sample standard deviations of x and y, respectively.

Plugging in the formula for Cov(x,y) into , we have

β^1=Cov(x,y)sx2=rsysxsx2=rsysx

We have found values of β^0 and β^1 that satisfy . Now we need to confirm that we have found the minimum value (rather than a maximum value or saddle point). To do, we use the second partial derivative. If it is positive, then we know we have found a minimum.

The second partial derivatives are

(A.10)2SSRβ02=β0(2i=1n(yiβ0β1xi))=2n>02SSRβ12=β1(2i=1nxi(yiβ0β1xi))=2i=1nxi2>0

Both partial derivatives are greater than 0, so we have shown that the estimators β^0 and β^1, in fact, minimize SSR.

The least-squares estimators for the intercept and slope are

β^0=y¯β^1x¯β^1=rsysx

A.2 Matrix representation of linear regression

The matrix representation for linear regression introduced in this section will be used for the remainder of this appendix and in . We will provide some linear algebra and matrix algebra details throughout, but we assume understanding of basic linear algebra concepts. Please see Chapter 1 An Introduction to Statistical Learning and online resources for an in-depth introduction to linear algebra.

Suppose we have a data set with n observations. The ith observation is represented as (xi1,,xip,yi), such that xi1,,xip are the predictor variables and yi is the response variable. We assume the data can be modeled using the least-squares regression model of the form in (see for more detail).

(A.11)yi=β0+β1xi1++βpxip+ϵϵN(0,σϵ2)

The regression model in can be represented using vectors and matrices.

(A.12)y=Xβ+ϵϵN(0,σϵ2I)

Let’s break down the components of .

(A.13)[y1yn]y=[1x11x1p1xn1xnp]X[β0β1βp]β+[ϵ1ϵn]ϵ

From and , we have the following components of the linear regression model in matrix form:

  • y is an n×1 vector of the observed responses.
  • X is an n×(p+1) matrix called the design matrix. The first column is always 1, a column vector of 1’s, that corresponds to the intercept. The remaining columns contain the observed values of the predictor variables.
  • β is a (p+1)×1 vector of the model coefficients.
  • ϵ is an n×1 vector of the error terms.

As before the error terms are normally distributed, centered at 0, a column vector of 0’s, with a variance σϵ2I, where I is the identity matrix.

The variance of the error terms ϵ is σϵ2I=σϵ2[100010001]=[σϵ2000σϵ2000σϵ2]

This is the matrix notation showing that the error terms are independent and have the same variance σϵ2.

Based on , the equation for the vector of estimated response values, y^, is

y^=Xβ

A.3 Estimating the Coefficients

A.3.1 Least-squares estimation

In matrix notation, the error terms can be written as (A.14)ϵ=yXβ

As with simple linear regression in , the least-squares estimator of the vector of coefficients, β^, is the vector that minimizes the sum of squared residuals in . (A.15)SSR=i=1nϵi2=ϵTϵ=(YXβ)T(YXβ)

where ϵT, the transpose of the vector ϵ.

Let’s walk through the steps to minimize . We start by expanding the equation.

(A.16)SSR=(YXβ)T(YXβ)=(YTYYTXββTXTY+βTXTXβ)

Note that (YTXβ)T=βTXTY. These are both constants (i.e. 1×1 vectors), so we haveYTXβ=(YTXβ)T=βTXTY. Plugging this equality into , we have

(A.17)SSR=YTY2βTXTY+βTXTXβ

Next, we find the partial derivative of with respect to β.

Let x=[x1x2xp]Tbe a k×1 vector and f(x) be a function of x.

Then xf, the gradient of f with respect to x is

xf=[fx1fx2fxp]T


Property 1

Let x be a k×1 vector and z be a k×1 vector, such that z is not a function of x . The gradient of xTz with respect to x is

xxTz=z


Property 2

Let x be a k×1 vector and A be a k×k matrix, such that A is not a function of x . The gradient of xTAx with respect to x is

xxTAx=(Ax+ATx)=(A+AT)xIf A is symmetric, then (A+AT)x=2Ax

Using the matrix calculus, the partial derivative of with respect to β is

(A.18)SSRβ=β(YTY2βTXTY+βTXTXβ)=2XTY+2XTXβ

Note that XTX is symmetric.

Thus, the least-squares estimator is the β^ that satisfies

(A.19)2XTY+2XTXβ^=0

The steps to find this β^ are below. (A.20)2XTY+2XTXβ^=02XTY=2XTXβ^XTY=XTXβ^(XTX)1XTY=(XTX)1XTXβ^(XTX)1XTY=β^

Similar to , we check the second derivative to confirm that we have found a minimum. In matrix representation, the second derivative is the Hessian matrix.

The Hessian matrix, x2f is a k×k matrix of partial second derivatives

x2f=[2fx122fx1x22fx1xp2f x2x12fx222fx2xp2fxpx12fxpx22fxp2]

If the Hessian matrix is…

  • positive definite, then we found a minimum.

  • negative definitive, then we found a maximum.

  • neither, then we found a saddle point.

Thus, the Hessian of is

(A.21)2SSRβ2=β(2XTY+2XTXβ)=2XTX

is proportional to XTX, which is a positive definite matrix. Therefore, we found a minimum.

The least-squares estimator in matrix notation is

β^=(XTX)1XTY

A.3.2 Geometry of least-squares regression

In , we used matrix calculus to find β^, the least-squares estimators of the model coefficients. In this section, we present the geometry of least-squares regression to derive the estimator β^. is a visualization of the geometric representation of least-squares regression.

Figure A.1: Geometry of least-squares regression

Let Col(X) be the column space of the design matrix X, the set all possible linear combinations of the columns of X. We cannot derive the values of y using only linear combinations of X (recall the error term ϵ in ). Therefore, y is not in Col(X). We want to find another vector Xb in Col(X) that minimizes the distance to y. We call the vector Xb a projection of y onto Col(X).

For any vector Xb in Col(X), the vector e=yXb is the difference between y and the projection Xb. We know X and y from the data, so we need to find b that minimizes the length of e=yXb. It is minimized when e is orthogonal (also called normal) to Col(X).

If A, an n×k matrix, is orthogonal to an n×1 vector c, then ATc=0

Because e is orthogonal to Col(X), then XTe=0. We plug in e=yXb and solve for b.

(A.22)XT(yXb)=0XTyXTXb=0XTy=XTXb(XTX)1XTy=b

Using the geometric interpretation of least-squares regression, we found that the vector b that minimizes e=yXb is b^=(XTX)1XTy. This is equal to the estimator found in .

A.4 Hat matrix

The fitted values of least-squares regression are y=Xβ^. Plugging in β^ from , we have

(A.23)y^=Xβ^=X(XTX)1XTHy

From , H=X(XTX)1XT is called the hat matrix. The hat matrix is an n×n matrix that maps y, the vector of observed responses, onto y^, the vector of fitted values (y^=Hy). More precisely, H is a projection matrix that projects y onto the column space Col(X) (see ). Because H is a projection matrix, it has the following properties:

  • H is symmetric (HT=H).

  • H is idempotent (H2=H).

  • If a vector v is in Col(X), then Hv=v.

  • If a vector v is orthogonal to Col(X), then Hv=0.

From , the hat matrix only depends on the design matrix X, i.e., it only depends on the values of the predictors. It does not depend on the vector of responses. The diagonal elements H are the values of leverage. More specifically, hii is the leverage for the ith observation. Recall from that in simple linear regression, the leverage is a measure of how far an observation’s value of the predictor is from the average value of the predictor across all observations.

(A.24)hii=1n+(xix¯)2j=1n(xjx¯)2

In multiple linear regression, the leverage is a measure of how far the ith observation is from the center of the x space. It is computed as

(A.25)hii=xiT(XTX)1xi

where xiT is the ith row of the design matrix X.

The sum of the leverages for all observation p+1 where p is the number of predictors in the model. Thus, rank(H)=i=1nhii=p+1. The average value of leverage is h¯=p+1n. Observations with leverage greater than 2h¯ are considered to have large leverage.

Using H, we can write the residuals as

(A.26)e=yy^=yHy=(IH)y

shows one feature of observations with large leverage. Observations with large leverage tend to have small residuals. In other words, the model tends to pull towards observations with large leverage.

A.5 Assumptions of linear regression

In , we introduced four assumptions of linear regression. Given the matrix form of the linear regression on model in , y=Xβ+ϵ such that ϵN(0,σϵ2I) , these assumptions are the following:

  1. The distribution of the response y given X is normal.
  2. The expected value of y is Xβ. There is a linear relationship between the response and predictor variables.
  3. The variance y given X is σϵ2I.
  4. The error terms in ϵ are independent. This also means the observations are independent of one another.

From these assumptions, we write the distribution of y given the regression model in .

(A.27)y|XN(Xβ,σϵ2I)

In , we showed Assumption 4 from the distribution of the error terms. Here we will show Assumptions 1 - 3.

Suppose z is a (multivariate) normal random variable such that zN(μ,Σ). A linear transformation of z is also multivariate normal, such that

Az+BN(Aμ+B,AΣAT)

The distribution of the error terms ϵ is normal, and the vector of responses y is linear combination of the error terms. More specifically, y is computed as the error terms, shifted by Xβ. Thus, from the math property above, we know that y is normally distributed.

Expected value of a vector

Let z=[z1zp]T be a p×1 vector of random variables. Then E(z)=E[z1zp]T=[E(z1)E(zp)]T


Let A be an n×p matrix of constants, C a n×1 vector of random variables, and z a p×1 vector of random variables. Then

E(Az+C)=AE(z)+E(C)

Next, let’s show Assumption 2 that the expected value of y=Xβ in linear regression. We can get the expected value of y from the mathematical properties above used for Assumption 1. Here, we show Assumption 2 directly from the properties of expected values.

(A.28)E(y)=E(Xβ+ϵ)=E(Xβ)+E(ϵ)=Xβ+0=Xβ

Let z be a p×1 vector of random variables. Then

Var(z)=E[(zE(z))(zE(z))T]

Lastly, we show Assumption 3 that Var(y)=σϵ2I in linear regression. Similar to the expected value, we can get the variance from the mathematical property used to show Assumption 1. Here, we show Assumption 3 directly from the properties of variance. (A.29)Var(y)=E[(yE(y))(yE(y))T]=E[(Xβ+ϵXβ)(Xβ+ϵXβ)T]=E[ϵϵT](given E(ϵ=0))=E[(ϵE(ϵ))(ϵE(ϵ))T]=Var(ϵ)=σϵ2I

A.6 Distribution of model coefficients

In , we introduced the distribution of a model coefficient β^jN(βj,SEβ^j2). In matrix notation, the distribution of all the estimated coefficients β^ is

(A.30)β^N(β,σϵ2(XT,X)1)

Similar to , let’s derive each part of this distribution. We’ll start with E(β^)=β.

(A.31)E(β^)=E((XTX)1XTy)=(XTX)1XTE(y)=(XTX)1XTE(Xβ+ϵ)=(XTX)1XTE(Xβ)+(XTX)1XTE(ϵ)=(XTX)1XTXβ+0=β

Next, we show that Var(β^)=σϵ2I.

(A.32)Var(β^)=E[(β^E(β^))(β^E(β^))T]=E[((XTX)1XTyβ)((XTX)1XTyβ)T]=E[((XTX)1XT(Xβ+ϵ)β)((XTX)1XT(Xβ+ϵ)β)T]After distributing and simplifying=E[(XTX)1XTϵϵTX(XTX)1]=(XTX)1XTE(ϵϵT)σϵ2IX(XTX)1=σ2I(XTX)1XTX(XTX)1=σϵ2(XTX)1

Lastly, we show that the distribution of β is normal. We’ll start by rewriting β^ in terms of β.

(A.33)β^=(XTX)1XTy=(XTX)1XT(Xβ+ϵ)=β+(XTX)1XTϵ

From , we see that β^ is a linear combination of the error terms ϵ. From the math property in , because ϵ is normally distributed, the linear combination is also normally distributed. Thus, the distribution of the coefficients β^ is normal.

A.7 Multicollinearity

Recall the design matrix for a linear regression model with p predictors in

(A.34)X=[1x11x1p1xn1xnp]

The design matrix X has full column rank rank(X)=(p+1) in linear regression. This means there are no linear dependencies among the columns, and thus no column is a perfect linear combination of the others. The equation for the least-squares estimator β^ in includes the term (XTX)1. If X is not full rank, then XTX is not full rank, and is therefore singular (not invertible). Thus, if there are linear dependencies in X, we are unable to compute an the least-squares estimator β^.

Let x1,x2,,xp be a sequence of vectors. Then, x1,x2,,xp are linearly dependent, if there exists scalars a1,a2,,ap such that

a1x1+a2x2++apxp=0

where a1,a2,,ap are not all 0.

In practice, we rarely have perfect linear dependencies. In fact, this is mathematically why we only include k1 terms in the model for a categorical predictor with k levels. Ideally the columns of X would be orthogonal, indicating the predictors are completely independent on one another. In practice, we expect there to be some dependence between predictors (we see this from the non-zero off diagonals in Var(β^)). If two or more variables are highly correlated, then there will be near linear dependence in the columns. This near-linear dependence is called multicollinearity.

In we discussed the practical issues that come from the presence of multicollinearity. These primarily stem from the fact that when there is multicollinearity, XTX is near-singular (almost a singular matrix) to a, thus making (XTX)1 very large and unstable. Consequently, Var(β^) is then large and unstable, thus making hard to find stable values for the least-squares estimators.

A.8 Maximum Likelihood Estimation

In , we introduced the likelihood function to understand the model performance statistics AIC and BIC. We also used a likelihood function to estimate the coefficients in logistic regression (more on this in ). The likelihood function is a measure of how likely it is we observe our data given particular value(s) of model parameters. When working with likelihood functions, we have fixed data (our observed sample data) and we can try out different values for the model parameters (β and σϵ2 in the context of regression).

Let z be a p×1 vector of random variables, such that z follows a multivariate normal distribution with mean μ and variance Σ. Then the probability density function of z is

f(z)=1(2π)p/2|Σ|1/2exp{12(zμ)TΣ1(zμ)}

In , we showed that the vector of responses y can be written as y|XN(Xβ,σϵ2I). Using this distribution to describe the relationship between the response and predictor variables, the likelihood function for the regression parameters β and σϵ2 is

(A.35)L(β,σϵ2|X,y)=1(2π)n/2σϵnexp{12σϵ2(yXβ)T(yXβ)}

where n is the number of observations, the mean is Xβ and the variance is σϵ2I.

The data y and X are fixed (we use the same sample data in the analysis!), so we can plug in values for β and σϵ2 to determine the likelihood of obtaining those values for the parameters given the observed data. In , we used least-squares estimation (minimizing SSR ) find the estimated coefficients β^. Another approach to estimate β (and σϵ2) is to find β^ (and σ^ϵ2) that maximizes the likelihood function in . This is called maximum likelihood estimation.

To make the calculations more manageable, instead of maximizing the likelihood function, we will instead maximize its logarithm, i.e. the log-likelihood function. The values of the parameters that maximize the log-likelihood function are those that maximize the likelihood function. The log-likelihood function is

(A.36)logL(β,σϵ2|X,y)=n2log(2π)nlog(σϵ)12σϵ2(yXβ)T(yXβ)

Given a fixed value of σϵ2, the log-likelihood in is maximized when (yXβ)T(yXβ) is minimized. This is equivalent to minimizing the SSR in . Thus, the least-squares estimator of β is also the maximum likelihood estimator ( β^=(XTX)1XTy) when the error terms are defined as in .

We previously found β^ using least-squares estimation and the geometry of regression in , so why does it matter that β^ is also the maximum likelihood estimator? First, maximum likelihood estimation is used to find the coefficients for generalized linear models and many other statistical models that go beyond linear regression (we’ll see how its used for logistic regression in ). In fact, Casella and Berger () stated maximum likelihood estimation is “by far, the most popular technique for deriving estimators.” [pp. 315]. Second, maximum likelihood estimators have nice statistical properties. These properties are beyond the scope of this text, but we refer interested readers to Chapter 7 of Casella and Berger () for an in-depth discussion of maximum likelihood estimators and their properties.

A.9 Variable transformations

In , we introduced regression models with transformations on the response and/or predictor variables. Here we will share some of the mathematical details behind the interpretation of the coefficients in these models.

A.9.1 Transformation on the response variable

In , we introduced a linear regression model with a log transformation on the response variable

(A.37)log(Y)=β0+β1X1++βpXp+ϵ

In this text, log(a) is the natural log of a (also denoted as ln(a)).

  • log(a)+log(b)=log(ab).

  • log(a)log(b)=log(ab)

  • elog(a)=a

  • eblog(a)=ab

From , we have that the change in log(Y) when Xj increase by 1 unit is βj. Thus βj=log(Y|Xj+1)log(Y|Xj), where Y|Xj+1 is the value of Y when we plug in Xj+1 and Y|Xj is the value of Y when we plug in Xj. In practice, we interpret the model coefficients βj in terms of the original variable Y, so we can use the rules of logarithms and exponents to rewrite this in terms of the original units of the response variable.

βj=log(Y|Xj+1)log(Y|Xj)=log(Y|Xj+1Y|Xj)βj×Y|Xj=Y|Xj+1

Thus given the model in , when Xj increases by 1 unit, we expect Y to multiply by βj, assuming all other predictors are held constant.

A.9.2 Transformation on predictor variable(s)

Next, we consider the models introduced in that have a log transformation on a predictor variable.

(A.38)Y=β0+β1X1++βjlog(Xj)++βpXp+ϵ

Now, we because the predictor Xj has been transformed to the logarithmic scale, we write interpretations in terms of a multiplicative change in Xj. More specifically, given a constant C, log(CXj)=log(C)+log(Xj). Let Y|CXj be the expected value of Y when we plug in CXj and Y|Xj be the expected value of Y when we plug in Xj. Assuming all other predictors held constant, we have

Y|CXjY|CX=βjlog(CXj)βjlog(Xj)=βj[log(CXj)log(Xj)=βj[log(C)+log(Xj)log(Xj)]=βjlog(C)Y|CXj=Y|Xj+βjlog(C)

Thus, given the model in , when predictor Xj is multiplied by a constant C, we expect Y to change (increase or decrease) by βjlog(C), holding all other predictors constant.

A.9.3 Transformation on response and predictor variables

Lastly, we show the mathematics behind the interpretation of a model coefficient βj for the linear regression models introduced in with a log transformation on the response variable and a predictor variable. As in , we want to write the interpretation in terms of the original response variable Y.

(A.39)log(Y)=β0+β1X1++βjlog(Xj)++βpXp

Combining the results from and and holding all other predictors constant, we have

(A.40)log(Y|CXj)log(Y|Xj)=βjlog(C)log(Y|CXjY|Xj)=βjlog(C)Y|CXjY|Xj=CβjY|CXj=Y|XjCβj

Therefore, given the model in with a transformed response variable and transformed predictor, when Xj is multiplied by a constant C, why is expected to multiply by Cβj, holding all other predictors constant.


  1. This is the span of X.↩︎

  2. Note that when there is a single predictor, and produce the same result.↩︎

  3. Note that the likelihood function is not the same as a probability function. In a probability function, we fix the model parameters and plug in different values for the data.↩︎