7  Multiple linear regression

7.1 Learning objectives

  • Understand how multiple predictor variables can be used to explain variability in a response variable
  • Interpret the model coefficients
  • Calculate predicted values from the multiple linear regression model
  • Use different types of predictors in the regression model
  • Fit and interpret models with a transformed response and/or predicted variable

7.2 Software and packages

7.3 Introduction: College alumni pay

Most university graduates get a job within the first few years of graduating college, so a natural question may arise is “what is the expected pay for recent graduates from a university?” In this chapter, we will explore that question by looking at potential characteristics of the college and university that may be associated with variability in the expected pay for recent graduates. The data were originally collected from the PayScale College Salary Report (2024)and were featured as part of the TidyTuesday College tuition, diversity, and pay, weekly data visualization challenge.

The data used in this chapter also includes the United States regions obtained from the state.region data frame in R.

This analysis will focus on the following variables:

Response

  • early_career_pay: Median salary for alumni with 0 - 5 years of work experience (in thousands of US Dollars)

Predictors

  • region : Geographic regions in the United States (North Central, Northeast, South, West)

  • type : Type of institution (Public, Private)

  • stem_percent: Percent of the institution’s student body pursuing degrees in STEM fields (Science, Technology, Engineering, Mathematics)

  • make_world_better_percent: Percent of current alumni who say they are making the world a better place.

7.3.1 Exploratory data analysis

The distribution and summary statistics for the response variable early_career_pay are shown in Figure 7.1 and Table 7.1, respectively. The center of the distribution is around the median of $49,700 and the middle 50% of the distribution has a spread of about \$9.500 (IQR). The distribution has a slight right skew, with some institutions have high median early career pay over $80,000.

Figure 7.1: Distribution of early career pay
Table 7.1: Summary statistics for early career pay
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
early_career_pay 51 8.2 32.5 45.5 49.7 55 88.8 0

Now let’s look at the distributions of the primary predictors for this analysis. The visualizations for each variable distribution and summary statistics for the quantitative predictors are in Figure 7.2 and Table 7.2.

Figure 7.2: Distributions of predictor variables
Table 7.2: Summary statistics of quantitative predictors
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
make_world_better_percent 53.6 8.8 33 48 52.0 58 94 30
stem_percent 16.6 15.4 0 7 12.5 22 100 0
Your turn!

Use the visualizations in Figure 7.2 and summary statistics in Table 7.2 to describe the distribution of each predictor variable.

Now, let’s conduct bivariate exploratory data analysis in Figure 7.3 and explore the relationship between the response and each of the predictor variables.

Figure 7.3

From Figure 7.3, we see that there is generally a negative relationship between an institution’s percentage of alumni who say they are making the world a better place and the median early career pay. In contrast, there is generally a positive relationship between the percentage of an institution’s student body pursuing STEM degrees and the median early career pay.

Graduates from institutions in the Northeast region have higher median early career pay compared to students from the other regions. The overlapping boxplots indicate there is generally not much difference in early career pay across the other regions. The median early career pay for students from private institutions is slightly higher than those from public institutions; however, the overlapping boxplots again indicate that this effect may not be strong.

Now that we have a better understanding of the variables in this analysis, we can fit a multiple linear regression model to understand how region, type, percentage of alumni who say they are making the world a better place, and the percentage of students pursuing STEM degrees explains variability in the expected early career pay.

7.4 Fitting the regression line

Thus far, we have examined the individual relationships between the response variable and each of the predictors. Now we will use multiple linear regression to fit a single model that helps us understand these relationships after adjusting for the other predictors.

7.4.1 Statistical model

Recall from Section 4.3 that the general form of a model for a response variable \(Y\) is

\[ Y = Model + Error \tag{7.1}\]

In the case of simple linear regression, \(Model = f(X)\) where \(X\) is the predictor variable used to explain the variability in \(Y\). Now we will extend Equation 7.1 to the case in which multiple predictor variables are used in combination to explain variability in the response variable \(Y\). The values of the response variable \(Y\) can be generated based on the following form

\[ \begin{aligned} Y &= Model + Error \\ & = f(\mathbf{X}) + \epsilon \\ & = f(X_1, X_2, \ldots, X_p) + \epsilon \end{aligned} \tag{7.2}\]

where \(\mathbf{X} = X_1, X_2, \ldots, X_p\) are the predictors and \(\epsilon\) is the amount in which the actual value of \(Y\) differs from what is expected from the model \(f(X_1, X_2, \ldots, X_p)\).

Equation 7.2 is the general form of the model. As with simple linear regression, we know the form of the model describing the relationship between the response and predictor variables (shown in Equation 4.1). If there is a linear relationship between \(Y\) and the predictors \(X_1, X_2, \ldots, X_p\), Equation 7.2 can be more specifically written as

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon, \hspace{8mm} \epsilon \sim N(0, \sigma^2_{\epsilon}) \tag{7.3}\]

where \(\epsilon \sim N(0, \sigma^2_{\epsilon})\) indicates the error terms are normally distributed with a mean of 0 and variance \(\sigma^2_{\epsilon}\). Equation 7.3 is the form of the population-level statistical model for multiple linear regression.

The model \(f(X_1, X_2, \ldots, X_p)\) outputs the expected value of \(Y\) for a combination of the predictors. This can be written as \(\mu_{Y|X} = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p\). Given this and Equation 7.3, we know the distribution of the response variable \(Y\) given the predictor variables \(X_1, X_2, \ldots, X_p\). More specifically, this distribution is shown in Equation 7.4 and [FIGURE NAME]

\[ Y|X_1, X_2, \ldots, X_p \sim N(\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p, \sigma^2_\epsilon) \tag{7.4}\]

As with simple linear regression, we will use what we know from Equation 7.3 and Equation 7.4 to estimate the coefficients for the model \(\beta_0, \beta_1, \ldots, \beta_p\) and the variance of the residuals \(\sigma^2_\epsilon\).

7.4.2 Is the MLR appropriate for the data?

Before we move forward with fitting the regression line and obtaining estimates for the coefficients, we should first ask whether the multiple linear regression model is appropriate or useful for the data. We can ask the same questions we did for simple linear regression; however, now we are considering the relationship between \(Y\) and multiple predictor variables rather than a single predictor.

  • Will a multiple linear regression model be practically useful? Does quantifying and interpreting the relationship between the variables make sense in this scenario?
  • Do the observations in the data represent the population of interest, or are there biases that should be addressed when drawing conclusions from the analysis?
  • Is the shape of the relationship between the response and the predictors approximately linear? Does a line reasonably describe the relationship?

The last question can be more challenging to answer for multiple linear regression than it was in the simple linear case, because it is difficult to visualize the relationship between three or more variables. As we saw in Section 7.3.1, we can visualize the relationship between the response and each individual predictor. This can give us some indication of linearity and if there are non-linear shapes that need to be accounted (more on this in later sections ). We will primarily rely on the assessment of the model conditions to ultimately determine whether the model is a useful fit for the data.

7.5 Estimating the coefficients

Recall that when we do linear regression, we are estimating the mean value of the response \(Y\) given some linear combination of the predictors \(X_1, X_2, \ldots, X_p\). Thus the predicted value of \(Y\) output from the estimated multiple linear regression is shown in Equation 7.5.

\[ \hat{Y} = \mu_{Y|X} = \hat{\beta}_0 + \hat{\beta}_1X_1 + \hat{\beta}_2X_2 + \dots + \hat{\beta}_pX_p \tag{7.5}\]

The error terms \(\epsilon\) are then estimated by calculating residuals, \(e = y - \hat{y}\), the difference between what is observed in the data and what is expected according to Equation 7.5. When conducting least squares regression (also known as ordinary least squares), the estimated coefficients \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p\) are the combination of coefficients that minimize the sum of square residuals shown in Equation 7.6

\[ \sum_{i=1}^{n}e_i^2 = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - [\beta_0 + \beta_1x_{i1} + \dots+ \beta_px_{ip}])^2 \tag{7.6}\]

7.5.1 Matrix notation for multiple linear regression

As you may have imagined, the differentiation required to find the values that minimize Equation 7.6 can become complicated, particularly if \(p\) is large, indicating a lot of terms in the model. We can instead think of Equation 7.3 and Equation 7.5 in terms of matrices and utilize some matrix algebra (and calculus) to obtain the estimated values. Suppose there are \(n\) observations and \(p\) predictors, then

\[ \mathbf{Y} = \begin{bmatrix}y_1 \\ y_2 \\ \vdots \\y_n\end{bmatrix} \hspace{10mm} \mathbf{X} = \begin{bmatrix}1 & x_{11} & x_{12} & \dots & x_{1p} \\1& x_{21} & x_{22} & \dots & x_{2p} \\\vdots & \vdots & \ddots & \vdots\\1& x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \]

where \(\mathbf{Y}\) is an \(n \times 1\) vector of the values of the response, and \(\mathbf{X}\) is an \(n \times (p + 1)\) matrix called the design matrix. The first column of \(\mathbf{X}\) is a vector of 1’s that correspond to the intercept \(\beta_0\). The remaining columns are the observed values for each predictor variable. Additionally let

\[ \boldsymbol{\beta}= \begin{bmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \hspace{5mm}\boldsymbol{\epsilon}= \begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \]

such that \(\boldsymbol{\beta}\) is a \((p + 1) \times 1\) vector of the model coefficients and \(\mathbf{\epsilon}\) is an \(n \times 1\) vector of the error terms. We can rewrite Equation 7.3 in matrix notation as

\[ \underbrace{\begin{bmatrix}y_1 \\ y_2 \\ \vdots \\y_n\end{bmatrix}}_{\mathbf{Y}} = \underbrace{\begin{bmatrix}1 & x_{11} & x_{12} & \dots & x_{1p} \\1& x_{21} & x_{22} & \dots & x_{2p} \\\vdots & \vdots & \ddots & \vdots\\1& x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix}}_{\mathbf{X}} \hspace{2mm} \underbrace{\begin{bmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}}_{\boldsymbol{\beta}} + \underbrace{\begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}}_{\boldsymbol{\epsilon}} \]

\[ \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \tag{7.7}\]

Similarly, we can rewrite Equation 7.5 as

\[ \mathbf{\hat{Y}} = \mathbf{X}\boldsymbol{\hat{\beta}} \]

Applying the rules of matrix algebra and matrix calculus, we find that the estimated coefficients \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p\) can be found using Equation 7.8

\[ \boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{Y} \tag{7.8}\]

We can use this formulation of the model to estimate \(\hat{\sigma}_\epsilon^2\) along with other useful model statistics and diagnostics. Doing these calculations using the matrix form of the model is beyond the primary scope of the text. Interested readers may find more detail on the matrix notation for linear regression in Section A.5.

7.5.2 Estimating coefficient using R

Similar to simple linear regression, we use the lm function to find the coefficient estimates \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p\) . We use + to separate the multiple predictors.

alumni_model <- lm(early_career_pay ~ stem_percent+ type + region +
                     make_world_better_percent,
                   data = alumni)

Similarly, we use the tidy function to restructure the model output into a data frame and kable to display the results in a neatly formatted table.

tidy(alumni_model) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 50.591 1.424 35.517 0.000
stem_percent 0.297 0.014 21.251 0.000
typePublic -1.832 0.427 -4.293 0.000
regionNortheast 5.248 0.614 8.540 0.000
regionSouth -0.661 0.523 -1.263 0.207
regionWest 3.885 0.680 5.710 0.000
make_world_better_percent -0.094 0.025 -3.830 0.000

7.6 Interpreting the coefficients

Now that we have our model, let’s interpret the model coefficients to describe how we expect early career pay to change as these factors change.

7.6.1 Interpreting quantitative predictors

The approach for interpreting quantitative predictors is similar as the approach in Section 4.5. If we have a quantitative predictor \(x_j\), its model coefficient is \(\beta_j\) is the slope describing the relationship between \(x_j\) and the response, after adjusting for the other predictors in the model. Therefore, when we interpret \(\beta_j\), we want to do so in a way that isolates that effect.

More specifically, the coefficient \(\beta_j\) is the expected change in the response variable when \(x_j\) increases by one unit, holding all other predictors constant. Let’s apply this to interpret the effect of stem_percent in ?tbl-alumni-model.

For each additional percentage of students graduating with a STEM degree, expected early career pay for graduates from an institution is expected to be greater by $297.22, holding all other predictors constant.

Your turn!

Interpret the coefficient make_world_better_percent in the context of the data.1

Recall from Section 4.5,

“In the context of the data” means

  • using the variable names or meaningful descriptions of the variables,
  • including the units in the interpretation, and
  • Indicating the population for which this model applies.

  1. For each additional percentage of students who want a job that makes the world a better place, expected early career pay for graduates from an institution is expected to be lower by $94.33, holding all other predictors constant.↩︎