7  Multiple linear regression

This chapter is a work in progress.

7.1 Learning objectives

  • Understand how multiple predictor variables can be used to explain variability in a response variable
  • Interpret the model coefficients
  • Calculate predicted values from the multiple linear regression model
  • Use different types of predictors in the regression model
  • Fit and interpret models with a transformed response and/or predicted variable

7.2 Software and packages

7.3 Introduction: College alumni pay

Most university graduates get a job within the first few years of graduating college, so a natural question may arise is “what is the expected pay for recent graduates from a university?” In this chapter, we will explore that question by looking at potential characteristics of the college and university that may be associated with variability in the expected pay for recent graduates. The data were originally collected from the PayScale College Salary Report ()and were featured as part of the TidyTuesday College tuition, diversity, and pay, weekly data visualization challenge.

The data used in this chapter also includes the United States regions obtained from the state.region data frame in R.

This analysis will focus on the following variables:

Response

  • early_career_pay: Median salary for alumni with 0 - 5 years of work experience (in thousands of US Dollars)

Predictors

  • region : Geographic regions in the United States (North Central, Northeast, South, West)

  • type : Type of institution (Public, Private)

  • stem_percent: Percent of the institution’s student body pursuing degrees in STEM fields (Science, Technology, Engineering, Mathematics)

  • make_world_better_percent: Percent of current alumni who say they are making the world a better place.

7.3.1 Exploratory data analysis

The distribution and summary statistics for the response variable early_career_pay are shown in and , respectively. The center of the distribution is around the median of $49,700 and the middle 50% of the distribution has a spread of about $9.500 (IQR). The distribution has a slight right skew, with some institutions have high median early career pay over $80,000.

Figure 7.1: Distribution of early career pay
Table 7.1: Summary statistics for early career pay
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
early_career_pay 51 8.2 32.5 45.5 49.7 55 88.8 0

Now let’s look at the distributions of the primary predictors for this analysis. The visualizations for each variable distribution and summary statistics for the quantitative predictors are in and .

Figure 7.2: Distributions of predictor variables
Table 7.2: Summary statistics of quantitative predictors
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
make_world_better_percent 53.6 8.8 33 48 52.0 58 94 30
stem_percent 16.6 15.4 0 7 12.5 22 100 0
Your turn!

Use the visualizations in and summary statistics in to describe the distribution of each predictor variable.

Now, let’s conduct bivariate exploratory data analysis in and explore the relationship between the response and each of the predictor variables.

Figure 7.3

From , we see that there is generally a negative relationship between an institution’s percentage of alumni who say they are making the world a better place and the median early career pay. In contrast, there is generally a positive relationship between the percentage of an institution’s student body pursuing STEM degrees and the median early career pay.

Graduates from institutions in the Northeast region have higher median early career pay compared to students from the other regions. The overlapping boxplots indicate there is generally not much difference in early career pay across the other regions. The median early career pay for students from private institutions is slightly higher than those from public institutions; however, the overlapping boxplots again indicate that this effect may not be strong.

Now that we have a better understanding of the variables in this analysis, we can fit a multiple linear regression model to understand how region, type, percentage of alumni who say they are making the world a better place, and the percentage of students pursuing STEM degrees explains variability in the expected early career pay.

7.4 Fitting the regression line

Thus far, we have examined the individual relationships between the response variable and each of the predictors. Now we will use multiple linear regression to fit a single model that helps us understand these relationships after adjusting for the other predictors.

7.4.1 Statistical model

Recall from that the general form of a model for a response variable Y is

(7.1)Y=Model+Error

In the case of simple linear regression, Model=f(X) where X is the predictor variable used to explain the variability in Y. Now we will extend to the case in which multiple predictor variables are used in combination to explain variability in the response variable Y. The values of the response variable Y can be generated based on the following form

(7.2)Y=Model+Error=f(X)+ϵ=f(X1,X2,,Xp)+ϵ

where X=X1,X2,,Xp are the predictors and ϵ is the amount in which the actual value of Y differs from what is expected from the model f(X1,X2,,Xp).

is the general form of the model. As with simple linear regression, we know the form of the model describing the relationship between the response and predictor variables (shown in ). If there is a linear relationship between Y and the predictors X1,X2,,Xp, can be more specifically written as

(7.3)Y=β0+β1X1+β2X2++βpXp+ϵ,ϵN(0,σϵ2)

where ϵN(0,σϵ2) indicates the error terms are normally distributed with a mean of 0 and variance σϵ2. is the form of the population-level statistical model for multiple linear regression.

The model f(X1,X2,,Xp) outputs the expected value of Y for a combination of the predictors. This can be written as μY|X=β0+β1X1+β2X2++βpXp. Given this and , we know the distribution of the response variable Y given the predictor variables X1,X2,,Xp. More specifically, this distribution is shown in and [FIGURE NAME]

(7.4)Y|X1,X2,,XpN(β0+β1X1+β2X2++βpXp,σϵ2)

As with simple linear regression, we will use what we know from and to estimate and conduct inference about the coefficients for the model β0,β1,,βp and the variance of the residuals σϵ2.

7.4.2 Is the MLR appropriate for the data?

Before we move forward with fitting the regression line and obtaining estimates for the coefficients, we should first ask whether the multiple linear regression model is appropriate or useful for the data. We can ask the same questions we did for simple linear regression; however, now we are considering the relationship between Y and multiple predictor variables rather than a single predictor.

  • Will a multiple linear regression model be practically useful? Does quantifying and interpreting the relationship between the variables make sense in this scenario?
  • Do the observations in the data represent the population of interest, or are there biases that should be addressed when drawing conclusions from the analysis?
  • Is the shape of the relationship between the response and the predictors approximately linear? Does a line reasonably describe the relationship?

The last question can be more challenging to answer for multiple linear regression than it was in the simple linear case, because it is difficult to visualize the relationship between three or more variables. As we saw in , we can visualize the relationship between the response and each individual predictor. This can give us some indication of linearity and if there are non-linear shapes that need to be accounted (more on this in later sections ). We will primarily rely on the assessment of the model conditions to ultimately determine whether the model is a useful fit for the data.

7.5 Estimating the coefficients

Recall that when we do linear regression, we are estimating the mean value of the response Y given some linear combination of the predictors X1,X2,,Xp. Thus the predicted value of Y output from the estimated multiple linear regression is shown in .

(7.5)Y^=μY|X=β^0+β^1X1+β^2X2++β^pXp

The error terms ϵ are then estimated by calculating residuals, e=yy^, the difference between what is observed in the data and what is expected according to . When conducting least squares regression (also known as ordinary least squares), the estimated coefficients β^0,β^1,,β^p are the combination of coefficients that minimize the sum of square residuals shown in

(7.6)i=1nei2=i=1n(yiy^i)2=i=1n(yi[β0+β1xi1++βpxip])2

7.5.1 (Optional) Matrix notation for multiple linear regression

As you may have imagined, the differentiation required to find the values that minimize can become complicated, particularly if p is large, indicating a lot of terms in the model. We can instead think of and in terms of matrices and utilize some matrix algebra (and calculus) to obtain the estimated values. Suppose there are n observations and p predictors, then

Y=[y1y2yn]X=[1x11x12x1p1x21x22x2p1xn1xn2xnp]

where Y is an n×1 vector of the values of the response, and X is an n×(p+1) matrix called the design matrix. The first column of X is a vector of 1’s that correspond to the intercept β0. The remaining columns are the observed values for each predictor variable. Additionally let

β=[β0β1βp]ϵ=[ϵ1ϵ2ϵn]

such that β is a (p+1)×1 vector of the model coefficients and ϵ is an n×1 vector of the error terms. We can rewrite in matrix notation as

[y1y2yn]Y=[1x11x12x1p1x21x22x2p1xn1xn2xnp]X[β0β1βp]β+[ϵ1ϵ2ϵn]ϵ

(7.7)Y=Xβ+ϵ

Similarly, we can rewrite as

Y^=Xβ^

Applying the rules of matrix algebra and matrix calculus, we find that the estimated coefficients β^0,β^1,,β^p can be found using

(7.8)β^=(XTX)1XTY

We can use this formulation of the model to estimate σ^ϵ2 along with other useful model statistics and diagnostics. Doing these calculations using the matrix form of the model is beyond the primary scope of the text. Interested readers may find more detail on the matrix notation for linear regression in .

7.5.2 Estimating coefficient using R

Similar to simple linear regression, we use the lm function to find the coefficient estimates β^0,β^1,,β^p . We use + to separate the multiple predictors.

alumni_model <- lm(early_career_pay ~ stem_percent+ type + region +
                     make_world_better_percent,
                   data = alumni)

Similarly, we use the tidy function to restructure the model output into a data frame and kable to display the results in a neatly formatted table.

tidy(alumni_model) |>
  kable(digits = 3)
Table 7.3: Output for model of early career salaries for college alumni
term estimate std.error statistic p.value
(Intercept) 50.591 1.424 35.517 0.000
stem_percent 0.297 0.014 21.251 0.000
typePublic -1.832 0.427 -4.293 0.000
regionNortheast 5.248 0.614 8.540 0.000
regionSouth -0.661 0.523 -1.263 0.207
regionWest 3.885 0.680 5.710 0.000
make_world_better_percent -0.094 0.025 -3.830 0.000

7.6 Interpreting the coefficients

Now that we have our model, let’s interpret the model coefficients to describe how we expect early career pay to change as these factors change.

7.6.1 Interpreting quantitative predictors

The approach for interpreting quantitative predictors is similar as the approach in . If we have a quantitative predictor xj, its model coefficient is βj is the slope describing the relationship between xj and the response, after adjusting for the other predictors in the model. Therefore, when we interpret βj, we want to do so in a way that isolates that effect.

More specifically, the coefficient βj is the expected change in the response variable when xj increases by one unit, holding all other predictors constant. Let’s apply this to interpret the effect of stem_percent in .

For each additional percentage of students graduating with a STEM degree, expected early career pay for graduates from an institution is expected to be greater by $297.22, holding all other predictors constant.

Your turn!

Interpret the coefficient make_world_better_percent in the context of the data.

Recall from ,

“In the context of the data” means

  • using the variable names or meaningful descriptions of the variables,
  • including the units in the interpretation, and
  • Indicating the population for which this model applies.

7.6.2 Interpreting categorical predictors

When we want to include categorical predictors in our model, we need a way to map the qualitative information into quantitative information that can understood by the model and our regression techniques. To do, we create an indicator variable for each level of the categorical predictor. We then use these indicator variables in the regression model.

Suppose we have a categorical variable that has k levels (or categories). Then we can make k indicator variables, one for each level, such that the indicator takes the value 1 if the observation belongs to the particular level and 0 if the observation does not belong to the particular level. Let’s take a look at the variable region in our data.

The variableregion has four levels: North Central, Northeast, South, West. We can create an indicator variable for each of these levels. From , we see the number of observation at level of the original variable region and the number of observations that take value 1 for the indicator variables.

Table 7.4: Indicator variables for region
region NorthCentral Northeast South West n
North Central 1 0 0 0 209
Northeast 0 1 0 0 151
South 0 0 1 0 295
West 0 0 0 1 115

Once we have quantified the information from the categorical predictor, we are ready to use the indicator variables in the model. You may notice in , however, that there are only 3 indicator variables (not 4) for region in the model. When we have k levels for categorical predictor, and thus can create k indicator variables, we include k1 of those indicator variables in the model. The level that does not have an indicator in the model is called the baseline level. When we interpret the coefficients for our indicator variables, we will interpret them as how the response differs for a given level, on average, compared to the baseline, holding all other variables constant.

Let’s take a look at the indicator variables for region in . The baseline level is “North Central”, the region that does not have an indicator in the model. Therefore, the interpretations of the coefficients for the other regions will be how the median early career salary for those regions compare, on average, to those for alumni from institutions in the North Central region.

The coefficient for the Northeast region β^West=5.248. This means that the expected early career salary is about $5,248 higher, on average, for alumni from institutions in the Northeast region compared to those for alumni from institutions in the North Central region, holding all else constant.

The coefficient for the South region is β^South= -0.661. This means that the expected early career salary is about $661 lower, on average, for alumni from institutions in the South region compared to those for alumni from institutions in the North Central region, holding all else constant.

The coefficient for the West region β^West=3.885. This means that the expected early career salary is about $3,885 higher, on average, for alumni from institutions in the West region compared to those for alumni from institutions in the North Central region, hold all else constant.

When we have indicator variables in the model, we are shifting the intercept of the model based on the levels of the categorical variable. The overall intercept of the model is the intercept for the baseline level for the categorical predictor. This means the intercepts for region are the following:

  • North Central: 50.591

  • Northeast: 56.019 (50.591 + 5.428)

  • South: 49.93 (50.591 - 0.661)

  • West: 54.476 (50.591 + 3.885)

Your turn!

The other categorical predictor in the model in is type.

  • What is the baseline level for type?

  • Interpret the effect of type in the context of the data.

  • What is the intercept for private institutions? For public institutions?

7.6.3 Interpreting the intercept

As with simple linear regression, the intercept is the expected value of the response when all predictors in the model equal 0. Let’s take a look at what this means for our model in . The intercept is at the point such that

  • stem_percent = 0: Zero percent of the students at the institution are studying STEM fields.
  • typePublic = 0: The institution is private.
  • regionNortheast = 0 and regionSouth = 0 and regionWest = 0: The institution is in the North Central region.
  • make_world_better_percent = 0: Zero percent of the students reported wanting a job that makes the world a better place.

Based on the model, the early career pay for students graduating from such an institution is about $50.59.

Putting this together in narrative form, we have

The expected early career pay for students graduating from an private institution in the North Central region that has no STEM graduates and no students who want a job to make the world a better place is about $50,591.13.

From , we see that we need the intercept to obtain the least-squares line. The intercept, however, does not always have a meaningful interpretation in practice. Therefore, as with simple linear regression, the intercept has meaningful interpretation in practice when the following are true:

  1. The scenario described by all predictors in the model being at or near zero simultaneously is plausible in practice.

  2. The data used to fit the model includes observations in which all the predictors are at or near zero simultaneously.

Your turn!

Based on these criteria, do you think the interpretation of the intercept is meaningful for the alumni data? Why or why not?

7.7 Other types of predictors

7.7.1 Centering quantitative variables

As we saw in , though we need an intercept to fit the regression model it often times does not have a meaningful interpretation. This is usually because it is not meaningful or realistic for a quantitative variable to take values at or near zero. One way to remedy this is transform a quantitative by shifting it, such that it would be feasible for the transformed variable to take values near zero. This process is called centering.

Centering a variable means to shift every observation of that variable by a single constant, C. For example, if we have a variable X, then the centered version of the variable is Xcent=XC. Thus if Xcent is a predictor in the model, then the intercept is now the expected response when Xcent=XC=0, which occurs when X=C.

Suppose we create a new variable stem_percent_cent that is stem_percent shifted by 20. The distributions of the original and centered variables are shown in and .

Figure 7.4: Original vs. centered stem_percent
Table 7.5: Original vs. centered stem_percent
Variable Mean SD Min Q1 Median (Q2) Q3 Max
stem_percent 16.6 15.4 0 7 12.5 22 100
stem_percent_cent -3.4 15.4 -20 -13 -7.5 2 80

From and , we see that the only statistics that have changed are those about location (e.g., Q1, Q3, min, max) and those related the center (mean and median). The shape of the distribution and statistics regarding the spread (standard deviation and IQR) have remained unchanged. By centering the variable we have essentially moved the distribution along the number line by some constant amount.

Similarly, in , both plots have the same slope illustrating that the slope of the relationship between the predictor and response has not changed by centering the predictor. All we have changed is what in means to be “zero”, and therefore observations whose expected response is at the intercept. This is illustrated by the different intercepts for the models shown in the two plots.

Figure 7.5: Relationship between original and centered stem_percent and the response

A common process of centering is mean-centering, in which all the quantitative predictors in the model are shifted by their respective mean. For the model in , this means transforming stem_percent and make_world_better_percent by shifting them by their respective means.

stem_cent=stem_percentmean(stem_percent)make_world_better_cent=make_world_better_percentmean(make_world_better_percent)

Note that we do not mean-center the categorical predictors type and region, because there is not a realistic notion of the “average region”. Additionally, there is already a meaningful interpretation for all predictors associated with region being equal to zero.

The output for the model with the mean-centered variables is in .

alumni <- alumni |>
  mutate(stem_cent = stem_percent - mean(stem_percent),
         make_world_better_cent = make_world_better_percent - mean(make_world_better_percent, na.rm = TRUE)
  )

alumni_model_cent <- lm(early_career_pay ~ stem_cent + type + region +
                          make_world_better_cent,
                        data = alumni)

tidy(alumni_model_cent) |>
  kable(digits = 3)
alumni_cent_intercept <- tidy(alumni_model_cent) |> select(estimate) |> slice(1) |> pull()
Table 7.6: Model for alumni data with mean-centered quantitative predictors
term estimate std.error statistic p.value
(Intercept) 50.465 0.415 121.460 0.000
stem_cent 0.297 0.014 21.251 0.000
typePublic -1.832 0.427 -4.293 0.000
regionNortheast 5.248 0.614 8.540 0.000
regionSouth -0.661 0.523 -1.263 0.207
regionWest 3.885 0.680 5.710 0.000
make_world_better_cent -0.094 0.025 -3.830 0.000

A comparison of the model coefficients from the original model and the model with the mean-centered values for the quantitative predictors is shown in . In what ways are the two models the same? In what ways are they different?

Table 7.7: Model coefficients for original model and model with mean-centered quantitiatve predictors
Term Original Mean-Centered
(Intercept) 50.591 50.465
stem_percent 0.297 0.297
typePublic -1.832 -1.832
regionNortheast 5.248 5.248
regionSouth -0.661 -0.661
regionWest 3.885 3.885
make_world_better_percent -0.094 -0.094

Note that the only term that changed between the two models is the intercept; the coefficients for all other predictors remained the same. This shows that centering a predictor, by shifting by the mean or some other constant does not change its relationship with the response variable. With this in mind, let’s interpret the intercept for the model with mean-centered variables.

As before, the intercept is the expected response, early career pay, when all the predictors are equal to 0. Thus, the intercept is the expected early career pay when

  • stem_percent_cent = 0: The percent of the students at the institution studying STEM fields is equal to the mean 16.599.
  • typePublic = 0: The institution is private.
  • regionNortheast = 0 and regionSouth = 0 and regionWest = 0: The institution is in the North Central region.
  • make_world_better_percent = 0: The percent of the students who reported wanting a job that makes the world a better place is equal to the mean 53.63.

Before we write the full interpretation, let’s see why stem_percent and make_world_better_percent are at the mean values when we interpret the intercept.

0=stem_percent_cent=stem_percentstem_percentstem_percent=stem_percent

The same holds true for make_world_better_percent. Thus, the interpretation of the intercept in narrative form when the quantitative predictors are mean-centered is

The expected early career pay for students graduating from an private institution in the North Central region that has 16.599% STEM graduates and NA % students who want a job job to make the world a better place is about $50,465.48.

Because the coefficients for the other predictors are the same, the interpretations for these predictors are the same as in and .

7.7.2 Standardizing quantitative variables

Thus far, we have discussed interpretation of the coefficients to understand the relationship between the response and each predictor variable. Sometimes, however, our primary objective of the analysis may be less focused on interpretation and more focused on which predictors have the strongest relationship with the response variable. When that is the case, we can fit a model using standardized values of the quantitative predictors.

  • Add something about why we can’t compare across scales for original variables.

Standardizing a variable means shifting every observation of that variable by the mean of the variable (as in mean-centering) and dividing by the standard deviation of that variable. Thus, given a quantitative predictor X, the standardized value of X is

Xstd=XX¯sX

where X¯ is the mean of X and sX is the standard deviation of X. When we standardize a quantitative predictor, we have only shifted the center such that the center is 0, but we have also rescaled by its standard deviation, thus the standard deviation (and variance) of the standardized variable is 1. This may sound familiar, as this is similar to a z- score that follows the standard normal distribution.

Unlike centering, standardizing a variable changes its units, since it is rescaled by the standard deviation. Therefore, the units are no longer the original units but rather the number of standard deviations an observation is away from the mean value. Because the units have changed, the value (but not the sign!) of the coefficient for standardized variables will change as well. shows the output of the model using standardized coefficients for stem_percent and make_world_better_percent. Similar to centering, we only standardize quantitative predictors, as there is no meaningful notion of “average” or “standard deviation” for categorical predictors like region or type.

alumni <- alumni |>
  mutate(stem_std = (stem_percent - mean(stem_percent))/sd(stem_percent),
         make_world_better_std = (make_world_better_percent - mean(make_world_better_percent, na.rm = TRUE)) / sd(make_world_better_percent, na.rm = TRUE)
  )

alumni_model_std <- lm(early_career_pay ~ stem_std + type + region +
                          make_world_better_std,
                        data = alumni)

tidy(alumni_model_std) |>
  kable(digits = 3)
Table 7.8: Model with standardized coefficients for quantitative variables
term estimate std.error statistic p.value
(Intercept) 50.465 0.415 121.460 0.000
stem_std 4.586 0.216 21.251 0.000
typePublic -1.832 0.427 -4.293 0.000
regionNortheast 5.248 0.614 8.540 0.000
regionSouth -0.661 0.523 -1.263 0.207
regionWest 3.885 0.680 5.710 0.000
make_world_better_std -0.830 0.217 -3.830 0.000

Now let’s compare the model with standardized quantitative predictors to the original model and the model using mean-centered quantitative predictors. shows the comparison of these models.

Table 7.9: Model coefficients for original model, model with mean-centered quantitiatve predictors, and model with standardized quantitative predictors
Term Original Mean-Centered Standardized
(Intercept) 50.591 50.465 50.465
stem_percent 0.297 0.297 4.586
typePublic -1.832 -1.832 -1.832
regionNortheast 5.248 5.248 5.248
regionSouth -0.661 -0.661 -0.661
regionWest 3.885 3.885 3.885
make_world_better_percent -0.094 -0.094 -0.830
Your turn!

The intercept for the model using mean-centered quantitative predictors is the same as the intercept for the model using standardized quantitative predictors. Why do these models have the same intercept?

The interpretation for the standardized predictors is in terms of a one standard deviation (one “unit”) increase in the predictor. Below is the interpretation for stem_percent.

For every one standard deviation in the percentage of STEM degrees, the early career pay for alumni is expected to increase by $4,587, holding all else constant.

Though this interpretation is technically correct, it is not meaningful to the reader to merely say a “one standard deviation increase”. Therefore, we can use the exploratory data analysis in to be more specific about how much a one standard deviation increase in the percentage of STEM degrees is and make this interpretation more meaningful.

For every 15.428 percentage point increase in STEM degrees, the early career pay for alumni is expected to increase by $4,587, holding all else constant.

We often standardize predictors when we want to look at the variables that have the strongest relationship with the response variable. Because all the quantitative predictors are now on the same scale, we can look at the magnitude of the coefficients to rank the predictors in terms of importance. From the coefficients in , we see that regioNortheast has the strongest relationship with the response, early career pay, because it has the coefficient with the highest magnitude, i.e., the highest absolute value.

Your turn!

Based on , which predictor is the least important for predicting early career pay?

7.8 Higher-order polynomial terms

Sometimes the relationship between the response and one or more predictors cannot be adequately described using a model of the form

(7.9)Y=β0+β1X1++βpXp+ϵ

Therefore, we may consider including polynomial terms or non-linear terms in the model to better account for the relationship between the response and predictors. When doing a regression analysis, we can use the exploratory data analysis to gain initial intuition about the relationship between the response and each predictor variable. While the exploratory data analysis is useful for intuition, it is often when examining the residual plots that we observe the need for such terms in the model.

When we fit a model of the form , we assume that the relationship between the response variable Y and the predictor Xj is always positive or always negative. This is called a monotonic relationship. This monotonic relationship may not adequately fit the data, however. For example, shows how poorly the least-squares regression line Y^=20+2.9X fits the relationship between the response Y and predictor X in the toy example data set.

Figure 7.6: Quadratic relationship between a response and predictor variable using toy data

As we see here, we can find the equation of the least-squares line even if such a line is not a good fit for the data. As with all linear regression analysis, we want to use the residuals to assess the LINE model assumptions discussed in [INSERT REF]. For now, let’s look at a plot of residuals versus fitted values to assess the linearity condition.

Figure 7.7: Plot of residuals versus fitted for toy data set

From , we see a clear violation of the linearity assumption. Recall the question from [INSERT REF]:

Based on the plot, can you generally determine with great certainty whether the residual would be positive or negative for a given fitted value or range of fitted values?

The answer to this question is yes. We can be fairly certain the residual will be positive for fitted values that are less than 0 or greater than 40. We can also be fairly certain that the residual will be negative for fitted values between 20 and 30 (though this is not always the case). Thus the linearity condition is not satisfied.

Even more than determining the linearity condition is violated, we can use the residual plot to determine the shape of the relationship between the response and predictor that isn’t being captured by the linear model. Recall, that the residuals should merely capture random error, so ideally there would be no pattern in the plot of residuals versus fitted values. In , we see quadratic (“U-shaped”) relationship, indicating there is still some systematic relationship between the response and predictor that must be accounted for. Therefore, we need to add a quadratic term X2 to the model to capture this relationship.

(7.10)Y=β0+β1X+β2X2+ϵ

shows the form of the linear regression model that includes a quadratic term. From , we see that a model of this form is a much better fit of the relationship between Y and X.

Figure 7.8: Relationship between Y and X with a model fit to capture quadratic relationship

The residual plot in shows that the linearity condition is now satisfied, and the residuals now represent the random error and no remaining systematic relationship between the response and predictor.

Figure 7.9: Plot of residuals versus fitted for toy data set

Now that we have established how quadratic effects can help with model fit, let’s discuss how to interpret the model with the quadratic terms. ?eq-quad-example-sqterm is the equation of the model shown in .

(7.11)Y^=6.2980.436 X+1.103 X2

Now that we have the quadratic term in the model, we will interpret the full effect of the predictor X including both the linear and quadratic terms. Recall that we write interpretations saying “holding all else constant”; however, it would be impossible to hold X2 constant as X is changing, and vice versa. Additionally, we are typically interested in the overall relationship between a response and predictor, not each component separately. Let’s consider the interpretation in terms of a one-unit increase in X. When X increases by 1 unit, we expect by to increase by 0.667 (-0.436 + 1.103) units.

This interpretation doesn’t quite give the full picture about the change in Y as X changes, because from , we see that the relationship is different for different ranges of X. Therefore, we can write the interpretations for different ranges of X to more holistically capture the relationship. The general form of the interpretation as X goes from a to b is

As X increases from a to b, Y is expected to change by 0.436×(ba)+1.103×(b2a2) , on average.

Based on , we will interpret the change in Y for the following ranges of X: –7 (minimum value) to -5, -5 to 5, 5 to 10.

When X increases from -7 to -5, Y is expected to decrease by 27.344 , on average.

When X increases from -5 to 5, Y is expected to decrease by 4.36 , on average.

Your turn!

Describe the change in Y when X increases from 5 to 10.

If you are unsure about the ranges for interpretation, one strategy is to interpret the change in Y when X makes the following changes: Minimum to Q1, Q1 to Q3, Q3 to Maximum.

What about X3, X4, etc?

In practice, we often want to avoid polynomial terms in the model that are higher than X2 unless the data show such a term is needed to sufficiently capture the relationship between the response and predictor variable (for example, there are scientific contexts in which a cubic term is necessary ). Unnecessarily adding higher-order polynomial terms of the model can lead to model overfit, a case in which the model fit so closely to the data set at hand that it cannot be used on any other data set. We talk more about overfit and strategies to avoid in it a later chapter.

If we do add higher-order polynomial terms to the model, then we interpret them using the same approach as with quadratic terms. For example, suppose we fit the model

Y=β0+β1 X+β2 X2+β3 X3

The interpretation when X changes from a to b is the following:

As X increases from a to b, Y is expected to change by β1×(ba)+β2×(b2a2)+β3×(b3a3) , on average.

7.9 Log-transformations

7.10 Interaction effects

  • Use interaction effects when one predictor’s effect on the response differs based on values of another predictor variable. There are three possible types of interaction terms that we will discuss in detail:
  • Categorical X quantitative

    • This is the scenario in which the effect of the quantitative predictor differs for different categories of a predictor variable

    • Show a graph of the line - the non-parallel slopes indicate a potential interaction effect

    • Interpretation of the actual interaction term

    • In practice we may just talk about the effect of a particular variable for different levels of the categorical predictor

  • Categorical X categorical

    • This is similar to the previous scenario, but in the scenario in which a value of a categorical predictor differs based on values of a different categorical predictor.

    • Because we have two categorical predictors, you can interchange how you interpret them.

      • One way to decide how to interpret them is to figure out which variable is the primary one of interest. Then you can interpret the relationship between that variable and the response for varying levels of the other predictor.
    • You can visualize this on a segmented barplot or a line plot?

  • Quantitative X quantitative

    • This is a less common type of interaction but can be used in certain contexts, such as in physics.

    • Similar to categorical X categorical in that you can interchange the interpretation

      • You can determine which variable is the primary one of interest and interpret the interaction effect based on varying values of the other predictor
    • When you interpret this, it is best to identify particular points - maybe a high, medium, and low value of one predictor and interpret the other predictor’s effect on the response.

    • Show what this looks like visually - and interpret the effects

7.11 Prediction

  • Prediction is similar to simple linear regression

  • you plug in the values and get the expected value of the response

  • Show an example

  • Your turn: find the predicted value for [insert combination of the predictors]

  • must be careful about extrapolation. given you are working in a lot of different dimensions, there are a lot of different ways you can extrapolate. so you want to be especially mindful not to extrapolate in terms of any single predictor or combination of predictors

  • there is variability in the prediction, which we will discuss in the chapter about inference.

library(tidyverse)
library(tidymodels)
lemurs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv')

# filter for just three species (maybe even just one species) adjults, and take the latest weight measurement for each lemur

set.seed(51325)

lemurs_samp <- lemurs |>
  filter(taxon %in% c("PCOQ", "ERUF",  "VRUB"), sex != "ND", age_at_wt_mo > 0) |> #only 1 obs in final sample with sex = ND
  group_by(dlc_id) |>
  sample_n(1) |>
  ungroup()

m <- lm(weight_g ~ taxon + sex + age_at_wt_mo  + litter_size, data = lemurs_samp)

m2 <- lm(weight_g ~ taxon + sex +scale(age_at_wt_mo, scale = FALSE) + scale(litter_size, scale = FALSE) , data = lemurs_samp)

m3 <- lm(weight_g ~ taxon + sex +scale(age_at_wt_mo) + scale(litter_size) , data = lemurs_samp)

m4 <- lm(weight_g ~ taxon + sex + age_at_wt_mo + litter_size + taxon * age_at_wt_mo , data = lemurs_samp)

m5 <- lm(weight_g ~ taxon + sex + age_at_wt_mo + litter_size + taxon * sex , data = lemurs_samp)

m6 <- lm(weight_g ~ taxon + sex + age_at_wt_mo + litter_size + age_at_wt_mo * litter_size, data = lemurs_samp)

m7 <- lm(weight_g ~ taxon + sex + age_at_wt_mo + litter_size + age_at_wt_mo + I(age_at_wt_mo^2), data = lemurs_samp)

m8 <- lm(weight_g ~ taxon + sex + age_at_wt_mo + litter_size + log(age_at_wt_mo), data = lemurs_samp)


tidy(m, conf.int = TRUE)
# A tibble: 6 × 7
  term         estimate std.error statistic  p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    1215.     223.       5.46  1.05e- 7   777.      1654. 
2 taxonPCOQ       640.     180.       3.56  4.41e- 4   286.       995. 
3 taxonVRUB      1394.     265.       5.26  2.87e- 7   872.      1916. 
4 sexM           -217.     139.      -1.56  1.20e- 1  -491.        56.6
5 age_at_wt_mo     11.5      1.11    10.3   1.81e-21     9.28      13.7
6 litter_size    -108.     138.      -0.782 4.35e- 1  -381.       164. 
tidy(m2, conf.int = TRUE) 
# A tibble: 6 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)             1727.     161.      10.8   6.96e-23  1411.      2043. 
2 taxonPCOQ                640.     180.       3.56  4.41e- 4   286.       995. 
3 taxonVRUB               1394.     265.       5.26  2.87e- 7   872.      1916. 
4 sexM                    -217.     139.      -1.56  1.20e- 1  -491.        56.6
5 scale(age_at_wt_mo, …     11.5      1.11    10.3   1.81e-21     9.28      13.7
6 scale(litter_size, s…   -108.     138.      -0.782 4.35e- 1  -381.       164. 
tidy(m3, conf.int = TRUE)
# A tibble: 6 × 7
  term                estimate std.error statistic  p.value conf.low conf.high
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)           1727.      161.     10.8   6.96e-23    1411.    2043. 
2 taxonPCOQ              640.      180.      3.56  4.41e- 4     286.     995. 
3 taxonVRUB             1394.      265.      5.26  2.87e- 7     872.    1916. 
4 sexM                  -217.      139.     -1.56  1.20e- 1    -491.      56.6
5 scale(age_at_wt_mo)    899.       87.0    10.3   1.81e-21     727.    1070. 
6 scale(litter_size)     -97.0     124.     -0.782 4.35e- 1    -341.     147. 
tidy(m4, conf.int = TRUE)
# A tibble: 8 × 7
  term                   estimate std.error statistic p.value conf.low conf.high
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)            1427.       229.      6.23   1.66e-9   976.     1878.  
2 taxonPCOQ               162.       212.      0.767  4.44e-1  -254.      579.  
3 taxonVRUB              1401.       289.      4.85   2.05e-6   832.     1969.  
4 sexM                   -277.       134.     -2.06   3.99e-2  -542.      -12.9 
5 age_at_wt_mo              8.29       1.84    4.49   1.02e-5     4.66     11.9 
6 litter_size            -117.       133.     -0.877  3.81e-1  -379.      145.  
7 taxonPCOQ:age_at_wt_mo   12.0        2.74    4.36   1.78e-5     6.56     17.3 
8 taxonVRUB:age_at_wt_mo   -0.242      2.52   -0.0961 9.24e-1    -5.19      4.71
tidy(m5, conf.int = TRUE)
# A tibble: 8 × 7
  term           estimate std.error statistic  p.value conf.low conf.high
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)      1251.     242.       5.18  4.26e- 7   775.      1726. 
2 taxonPCOQ         575.     241.       2.38  1.78e- 2   100.      1050. 
3 taxonVRUB        1358.     312.       4.35  1.88e- 5   744.      1972. 
4 sexM             -301.     270.      -1.11  2.66e- 1  -832.       230. 
5 age_at_wt_mo       11.5      1.12    10.2   3.92e-21     9.26      13.7
6 litter_size      -108.     139.      -0.779 4.37e- 1  -382.       165. 
7 taxonPCOQ:sexM    147.     359.       0.410 6.82e- 1  -560.       854. 
8 taxonVRUB:sexM     85.6    350.       0.244 8.07e- 1  -604.       775. 
tidy(m6, conf.int = TRUE)
# A tibble: 7 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)           1021.       241.     4.25    2.96e- 5   548.    1495.   
2 taxonPCOQ              673.       180.     3.74    2.21e- 4   319.    1027.   
3 taxonVRUB             1429.       264.     5.41    1.35e- 7   909.    1948.   
4 sexM                  -230.       138.    -1.66    9.72e- 2  -503.      42.1  
5 age_at_wt_mo            15.8        2.36   6.68    1.24e-10    11.1     20.4  
6 litter_size              0.431    147.     0.00292 9.98e- 1  -290.     291.   
7 age_at_wt_mo:litter_…   -2.60       1.26  -2.07    3.97e- 2    -5.08    -0.123
tidy(m7, conf.int = TRUE)
# A tibble: 7 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)        633.     188.         3.36  8.86e- 4  262.     1003.   
2 taxonPCOQ          780.     148.         5.28  2.52e- 7  489.     1070.   
3 taxonVRUB         1329.     217.         6.14  2.83e- 9  903.     1755.   
4 sexM              -230.     114.        -2.03  4.34e- 2 -454.       -6.88 
5 age_at_wt_mo        41.1      2.64      15.6   4.93e-40   35.9      46.3  
6 litter_size        -79.4    113.        -0.702 4.83e- 1 -302.      143.   
7 I(age_at_wt_mo^2)   -0.139    0.0116   -12.0   5.03e-27   -0.162    -0.116
tidy(m8, conf.int = TRUE)
# A tibble: 7 × 7
  term              estimate std.error statistic  p.value conf.low conf.high
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)         390.     134.         2.90 3.97e- 3   126.      654.  
2 taxonPCOQ           780.     105.         7.44 1.24e-12   574.      987.  
3 taxonVRUB          1103.     155.         7.13 8.21e-12   799.     1408.  
4 sexM                -94.9     81.0       -1.17 2.43e- 1  -254.       64.6 
5 age_at_wt_mo         -3.81     0.914     -4.17 4.01e- 5    -5.61     -2.01
6 litter_size         -84.1     80.6       -1.04 2.97e- 1  -243.       74.5 
7 log(age_at_wt_mo)   616.      26.1       23.6  5.36e-69   565.      668.  
ggplot(data = lemurs_samp, aes(x = age_at_wt_mo, y = weight_g, color = taxon)) + 
  geom_point()


  1. For each additional percentage of students who want a job that makes the world a better place, expected early career pay for graduates from an institution is expected to be lower by $94.33, holding all other predictors constant.↩︎

  2. The intercept means the expected value of the response when all predictors are equal to 0. Given a quantitative predictor, X, the mean-centered value of X is XX¯ and the standardized value of X is XX¯sX. Both of those are equal to 0, when XX¯=0X=X¯, i.e., X is equal to the mean.↩︎

  3. The variable make_world_better_percent , because it has the coefficient with the smallest magnitude, |-0.094| = 0.094.↩︎

  4. When X increases from 5 to 10, Y is expected to increase by 80.545 , on average.↩︎