4  Simple linear regression

Learning outcomes

  • Use exploratory data analysis to explain whether a simple linear regression is an appropriate model to describe the relationship between two variables.
  • Fit a simple linear regression model using equations and software.
  • Interpret the slope and intercept in the context of the data.
  • Use a simple linear regression for prediction.
  • Partition sources of variability in the response variable.
  • Calculate and interpret \(R^2\) in the context of the data.

Software and packages

4.1 Introduction: Movie ratings on Rotten Tomatoes

You decide to go to the movies and am trying to figure out which movie to see. Even after watching a few enticing movie trailers, you’re still unsure about which movie you’d enjoy watching. To help decide, you use data from movie critics and other regular movie-goers.

The data includes the critics’ and audience scores for 146 movies released in 2014 and 2015 . These are every movie released in this year that have “a rating on Rotten Tomatoes, a RT User rating, a Metacritic score, a Metacritic User score, an IMDb score, and at least 30 fan reviews on Fandango.” (Albert Y. Kim, Ismay, and Chunn 2018a) These scores are from Rotten Tomatoes, a website for information and ratings on movies and television shows. The data were originally analyzed in the article “Be Suspicious of Online Movie Ratings, Especially Fandango’s” (Hickey 2015) on the data journalism site FiveThirtyEight. The data are in the movie_scores data frame; it was adapted from the fandago data frame in the fivethirtyeight R package (Albert Y. Kim, Ismay, and Chunn 2018b).

We will focus on two variables for this analysis:

  • critics: Critics score calculated as the percentage of critics who have a favorable review of the movie. This is known as the “Tomatometer” score on the Rotten Tomatoes website. The possible values are 0 - 100.

  • audience: Audience score calculated as the percentage of users on the site (regular movie-goers) who have a favorable review of the movie. The possible values are 0 - 100.

The objective of this analysis is to model the relationship between the critics score and audience score using simple linear regression. We want to use the model to

  • describe how the audience score is expected to change as the critics score changes.

  • predict the audience score for a movie given its critics score

Review

The population is the group we’re interested in understanding using statistical analysis. This could be a group of people, places, objects, etc.


The sample is the subset of the population on which we have data for the analysis. We analyze the sample data to derive insights about the population. Ideally the sample has been generated in a way such that it is representative of the population, so that we can draw more generalizable conclusions about the population.


What is the population in the movie scores analysis? What is the sample?1

Before taking a look at the data, let’s define two terms that will be important for this chapter and the rest of the text. The response variable is the outcome of interest. It is also known as the outcome or dependent variable and denoted as \(Y\). The predictor variable(s) is the variable (or variables) used to understand variability in the response. It is also known as the explanatory or independent variable. In this chapter, we will fit and analyze models with one predictor variable. We will extend to the case of multiple predictor variables in Chapter 7.

Your turn!

What is the response variable for the movie scores analysis? What is the predictor variable?2

4.2 Exploratory data analysis

We start every analysis with exploratory data analysis (EDA) to better understand the observations in the data, the distributions of the variables, and begin gaining initial insights about the relationships between the variables of interest. EDA can also help us identify outliers or other unusual observations, missingness in the data, and potential errors in the data, such as errors in how the data were recorded or how the data set was loaded into the statistical software.

We’ll do an exploratory data analysis that focuses only on the two variables that will be in the regression model. In practice, however, we may want to explore other variables in the data set (e.g., year) to provide additional context to the data as we interpret results. We begin with univariate EDA, exploring one variable at a time, then we’ll conduct bivariate EDA to begin examining the relationship between critics and audience scores.

4.2.1 Univariate EDA

The univariate distributions of the critics and audience scores are visualized in Figure 4.1 and summarized in Table 4.1.

Code
p_critics <- ggplot(data = movie_scores, aes(x = critics)) + 
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black" ) + 
  labs(x = "Critics Score", 
       y = "Count") +
  xlim(0,100)

p_audience <- ggplot(data = movie_scores, aes(x = audience)) + 
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black") + 
  labs(x = "Audience Score", 
       y = "Count") +
  xlim(0,100)

p_critics + p_audience
Figure 4.1: Univariate distributions of critics scores and audience scores on Rotten Tomatoes.
movie_scores |>
  skim(critics, audience) |>
  select(skim_variable, numeric.mean, numeric.sd, numeric.p0, 
         numeric.p25, numeric.p50, numeric.p75, numeric.p100, n_missing) |>
  kable(col.names = c("Variable", "Mean", "SD", "Min", "Q1", 
                      "Median (Q2)", "Q3", "Max","Missing"), 
        digits = 1)
Table 4.1: Summary statistics for audience and critics score
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
critics 60.8 30.2 5 31.2 63.5 89 100 0
audience 63.9 20.0 20 50.0 66.5 81 94 0

The description of the univariate distribution has four components:

  • Shape: A description of the shape includes the skewness (left-skewed, right-skewed, symmetric) and the number of modes, i.e., peaks (unimodal, bimodal, multimodal).

  • Center: The mean or median are typically used to describe the center of the distribution. To determine which measure is the best representation of the center, consider the shape of the distribution and presence of outliers. If the distribution is approximately symmetric, then the mean is the better measure of center. One reason for this is that the mean is calculated using all the values in the data set, in contrast to the median which only takes into account the middle value (or middle two values if there are an even number of observations). The mean, however, is affected by skewness in the data and the presence of outliers. Therefore, if either of these are present, the median is the more reliable measure of the center of the distribution.

  • Spread: The standard or inter-quartile range (IQR) are used to describe the spread. If the data are approximately symmetric with no outliers, the standard deviation is a good measure of the spread. The standard deviation is impacted by skewness and outliers, because the mean is used to calculate it. If the distribution is skewed or has outliers, the IQR \((Q_3 - Q_1\)) is a more reliable measure of spread.

Reporting center and spread

To describe the center and spread of a distribution, report

  • the mean and standard deviation, or

  • the median and IQR

Using range as a measure of spread

The range \((max - min)\) is another commonly used measure of the spread of the distribution. The range, however, should be used with caution and never reported as the only measure of spread. Because it only takes into account the tails of the distribution, it only gives an indication of what is happening on the extremes of the distribution, not in the middle where a majority of the data typically lie. Additionally, it is heavily affected by outliers, so it can be potentially misleading measure of the spread.

  • Outliers or other notable patterns: The last part of describing a univariate distribution is the presence of outliers or other interesting or unusual patterns in the data. Outliers can be observations that just happen to be different from the others (e.g,. Bill Gates’, a co-founder of Microsoft, salary compared to the salary of 1000 randomly selected adults in the United States); however, they may also be due to data entry errors (e.g., a person’s age recorded as 150 years). Unusual patterns are those that may not follow what we would expect, such as a mode at an unexpected value. This often happens in practice with modes at values such as -1 or 0, which are intended to represent missing data rather than actual observed values.

    Once the outliers have been identified and better understood from further investigation, there are options on how to address them. If they are merely unusual observations, it is good practice to keep them in the analysis or compare models fit with and without these observations. If we remove them from the analysis, we must note that they’ve been removed and discuss potential limitations in the scope of the conclusions. If the outliers are a result of a data entry error, then it is recommended to correct the value, if it is possible to determine what the intended value is, or remove the observation from the analysis. Again, it is important to document how the outlying observations are handled and handle them in a reproducible way. Handling outliers in regression is discussed in more detail in (ch-slr-conditions?).

Putting all this together, we now describe the univariate distribution of the predictor variable critics.

The distribution of critics is left-skewed with the movies in the data set are generally more favorably reviewed by critics (higher critics scores). Given the apparent skewness the center is the median score of 63.5. The IQR describing the spread of the 50% of the distribution is 57.8 points (89 - 31.2), so there is a lot of variability in the critics scores for the movies in the data. There are no apparent outliers, but there are two notable observations of movies that have perfect critics scores of 100 (observed in the raw data). There are no missing values of critics score.

Your turn!

Use the histogram in Figure 4.1 and summary statistics in Table 4.1 to describe the distribution of the response variable audience.

4.2.2 Bivariate EDA

After we’ve examined the variables individually, we can explore relationships between variables. We’ll focus on the relationship between the response and predictor variable we’re studying; however, there may be other variable relationships we want to understand to provide context to our analysis results.

Similar to univariate EDA, we will visualize the relationship between variables and calculate a summary statistic to better quantify the association. A scatterplot of the the audience score versus critics score is shown in Figure 4.2. When making the scatterplot, put the predictor variable on the \(x\)-axis (horizontal axis) and the response variable on the \(y\)-axis (vertical axis).

Code
ggplot(data = movie_scores, mapping = aes(x = critics, y = audience)) +
  geom_point(alpha = 0.5) + 
  labs(x = "Critics Score" , 
       y = "Audience Score") +
  theme_bw()
Figure 4.2: Relationship between critics and audience scores on Rotten Tomatoes.

The correlation, \(r\), is a measure of the direction and strength of the linear relationship between two variables. It ranges from -1 to 1, with \(r \approx -1\) meaning a very strong negative relationship, \(r \approx 1\) strong positive relationship We will use the correlation, and \(r \approx 0\) meaning a very weak to no linear relationship. The correlation between critics and audience score is \(r\) = 0.78.

Similar to univariate EDA, we include several features when describing the relationship between the two variables. These are shape, direction (if applicable), strength, outliers, and other interesting features. Below is an explanation of each components, followed by a description of the relationship between critics and audience score.

  • Shape: The shape is the general pattern of the points in the scatterplot. The most common shapes we may see are linear, quadratic, cubic, and no discernible pattern.

  • Direction: If the shape is linear, then we can describe the overall direction of the points. The direction is positive if \(y\) tends to increase as \(x\) increases, negative if \(y\) tends to decrease as \(x\) increases, and no direction if \(y\) is approximately the same for all values of \(x\). The sign of the correlation coincides with the direction of linear relationships.

  • Strength: The strength is a measure how closely the observations follow the overall pattern or shape. Points that are tightly clustered together indicate a stronger relationship than points that are more dispersed. When the shape is linear, the correlation quantifies the strength of the relationship between the variables.

  • Outliers: As in univariate EDA, outliers are points that do not follow the general pattern of the data. These can be points that are outliers in the \(x\)-direction, the \(y\) direction, or both. These points are important to identify in the EDA, as they may influence the regression model. We’ll talk more about the impact of outliers on the regression model in a later section.

  • Other interesting features: There may be other features of the scatterplot that are interesting to highlight. For example, there may be different variability (spread) in the points as \(x\) increases. These kind of interesting features may help provide additional explanation as we assess the fit of the regression model and understand the estimates.

Below is a summary of the bivariate EDA for the movie scores data.

There is a positive, linear relationship between the critics and audience scores for movies on Rotten Tomatoes. The correlation between these two variables is 0.78, indicating the relationship is moderately strong. Therefore, we can generally expect the audience score to be higher for movies with higher critics scores. There are no apparent outliers, but there does appear to be more variability in the audience score for movies with lower critics scores than for those with higher critics scores.

4.3 Fitting the regression line

As we saw in Section 4.2, we can use visualizations and summary statistics to describe the relationship between two variables. The EDA, however, does not give enough information to reliably predict the response for a given value of the predictor or reliably explain how much the response is expected to change as the predictor changes. Therefore, we will use regression to fit a model to the data and more robustly quantify the relationship between the response and predictor variable. More specifically, we will fit a model of the form

\[ Y = \beta_0 + \beta_1 X + \epsilon \tag{4.1}\]

Equation 4.1, called a simple linear regression (SLR) model, is the equation of a line representing the relationship between one predictor variable an done quantitative response variable. In later chapters, we will consider multiple linear regression models with two or more predictors.

We fit regression models to conduct two types of tasks:

  • Prediction: Finding the expected value of the response variable for given values of the predictor variable(s).
  • Inference: Drawing conclusions about the relationship between the response and predictor variable(s).
Your turn!

We will fit a simple linear regression line to describe the relationship between critics and audience score.

  • What is an example of a prediction question regression can help us answer?3

  • What is an example of an inference question regression can help us answer?4

4.3.1 Statistical model for SLR

Suppose we have a response variable \(Y\) and a predictor variable \(X\). The values of the response variable \(Y\) can be generated in the following way:

\[ Y = Model + Error \]

More specifically, we can define the model as a function of the predictor \(X\) such that

\[ Y = f(X) + \epsilon \tag{4.2}\]

The function \(f(X)\) that describes the relationship between the response and predictor variable is the regression model. The error, \(\epsilon\), is how much the actual value of the response \(Y\) deviates from the value produced by the regression model.

Equation 4.2 is the general form of a model of the relationship between \(X\) and \(Y\). When we use simple linear regression for the relationship between a quantitative response variable and one predictor variable, the function \(f(X)\) in Equation 4.2 is

\[ f(X) = \mu_{Y|X} = \beta_0 + \beta_1X \tag{4.3}\]

and the error terms \(\epsilon\) from Equation 4.2 are normally distributed with a mean of 0 and variance \(\sigma_{\epsilon}^2\). The full specification of the simple linear regression model is in Equation 4.4. This may look familiar, as the function was originally presented in Equation 4.1; here we have completed the specification of the simple linear regression model with the inclusion of the distribution of the errors.

\[ Y = \beta_0 + \beta_1 X + \epsilon \hspace{7mm} \epsilon \sim N(0, \sigma_{\epsilon}^2) \tag{4.4}\]

This model is known as the statistical model, or data-generating model, because it tells us exactly how to generate the values of the response \(Y\) given values of the predictor in the population. In this model, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\sigma^2_{\epsilon}\) is the standard error. In practice we don’t know the exact values of \(\beta_0\), \(\beta_1\), and \(\sigma_{\epsilon}^2\), so our goal is to use the sample data to estimate these values. In the remainder of this chapter we’ll focus on estimating \(\beta_0\) and \(\beta_1\). We’ll discuss \(\sigma_{\epsilon}^2\) in more detail in the next chapter.

In Equation 4.3 the mean value of \(Y\) for a given value of \(X\) is \(\mu_{Y|X} = \beta_0 + \beta_1X\). Thus when a given value of the predictor \(x_i\) is input into Equation 4.3 , \(\beta_0 + \beta_1x_i\), it outputs \(\mu_{Y|x_i}\), the mean value of the response when the predictor equals \(x_i\) .

Terminology
  • The intercept \(\beta_0\) is the mean value of the response \(Y\) when \(X = 0\).

  • The slope \(\beta_1\) is the change in the mean value of \(Y\) when \(X\) increases by 1.

  • The error \(\epsilon_i\) is how much the value of the response variable \(Y_i\) deviates from the mean value of the response given \(X_i\).

  • The standard error \(\sigma_{\epsilon}\) is the variability in the error terms.

Now that we have specified the form of the model, we can estimate \(\beta_0\) and \(\beta_1\) (and later on \(\sigma_{\epsilon}^2\)) using sample data.

4.3.2 Determining if SLR is appropriate

Before doing any more calculations, we need to determine if the simple linear regression model is a reasonable choice for the data based on what we know about the data and what we’ve observed from the exploratory data analysis. Below are a few considerations:

  • Will a simple linear regression model be practically useful? Does quantifying and interpreting the relationship between the variables make sense in this scenario?
  • Is the shape of the relationship approximately linear? Does a line reasonably describe the relationship?
  • Do the observations in the data represent the population of interest, or are there biases that should be addressed when drawing conclusions from the analysis?
Warning

Mathematical equations or statistical software can be used to fit a linear regression model between any two quantitative variables. It is upon judgment of the data scientist to determine if it is a reasonable choice to proceed with a linear regression model or if doing so could lead to misleading conclusions.

If the answer is “no” to any of the questions above, consider if a different analysis technique is better for the data, or proceed with caution if using regression. If you proceed with regression, be transparent about some of the limitations of the conclusions.

As described in Section 4.1, the goal of this analysis is understand the relationship between the critics and audience score. Therefore, there is a practical use for fitting the regression model. We observed from Figure 4.2 that the relationship between the two variables is approximately linear, so it could reasonably be summarized using a line. Lastly, the data set includes all movies in 2014 and 2015 that were rated on popular movie ratings websites, so we can reasonably conclude the sample is representative of the population of movies on Rotten Tomatoes. Therefore, we are comfortable drawing conclusions about the population based on the analysis of our sample data.

The simple linear regression model for the movie scores data is

\[ audience = \beta_0 + \beta_1~critics + \epsilon \tag{4.5}\]

In the upcoming sections we’ll talk about this form of the model, how to estimate the coefficients of the model \(\beta_0\) and \(\beta_1\), how to interpret the coefficients, and how to use the model for prediction.

4.4 Estimating slope and intercept

Ideally, we would have data from the entire population of movies rated on Rotten Tomatoes in order to calculate the exact values for \(\beta_1\), \(\beta_0\), and \(\sigma_{\epsilon}^2\). In reality we don’t have access to the data from the whole population, but we can use the sample to obtain the regression equation in Equation 4.6.
\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X, \tag{4.6}\]

where \(\hat{\beta}_0\) is the estimated intercept, \(\hat{\beta}_1\) is the estimated slope, and \(\hat{Y}\) is the estimated response.

Specifically for the movie scores data, the estimated regression equation is

\[ \hat{audience}_i = 32.3155 + 0.5187~ critics_i \tag{4.7}\]

The subscript \(i\) denotes the \(i^{th}\) observation. In this equation 32.3155 is the estimated value for the intercept \(\beta_0\), 0.5187 is the estimated value for the slope \(\beta_1\), and \(\hat{audience}_i\) is the expected audience score when the critics score equals \(critics_i\). Notice that Equation 4.6 and Equation 4.7 do not have have error terms. The output from the regression equation is the expected mean value of the response for a given value of the predictor. Therefore, when we discuss the values of the response estimated simple linear regression, what we are really talking about is what we expect the value of the response to be, on average, for a given value of the predictor variable.

Figure 4.3: Least squares regression line for the relationship between critics and audience scores

If we think about the data, we know that the value of the response is not necessarily the same for all observations with the same value of the predictor. In terms of the movie scores, we wouldn’t expect (nor do we observe) the same audience score for every movie with a critics score of 70. We know there are other factors other than the critics score that are related to how an audience reacts to a movie. Our data, however, only contains the critics score, so we are unable to capture these additional factors in our regression equation. This is where the error terms come back in!

Once we have estimated \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the regression equation, we can calculate how far the estimated values of the response produced by the regression model differ from the actual values of the response observed in the data. This difference is called the residual.

Terminology

The residual is the difference between the observed and predicted values of the response for a given observation.

\[ residual_i = observed_i - predicted_i \]

We use \(e_i\) to represent the residual for the \(i^{th}\) observation. Equation 4.8 shows the equation of the residual written in statistical notation.

\[ e_i = Y_i - \hat{Y}_i \tag{4.8}\]

In the case of the movie scores data, the residual is the difference between the actual audience score and the audience score estimated by Equation 4.7. For example, the 2015 movie Avengers: Age of Ultron received a critics score of 74. Therefore, using Equation 4.7, the estimated (predicted) audience score is.

\[ 32.3155 + 0.5187 \times 74 = 70.6993. \]

The observed audience score is 86, and the residual is

\[ e_i = 86 - 70.6993 = 15.3007 \]

Your turn!

Would you rather see a movie that has a positive or negative residual? Why?

4.4.1 Least squares regression

As shown in Figure 4.4, there are many possible lines (infinitely many, in fact) that we could use to describe the relationship between the critics and audience scores. So how did we determine the line that “best” fits the data is Equation 4.7? We’ll rely on the residuals to help us answer this question.

Figure 4.4

The residuals, represented by the vertical dotted lines in Figure 4.5, are a measure of the “error”. The line that “best” fits the data is the one that results the smallest total error. One way we could approach this is to add up all the residuals for each possible line in Figure 4.4 and choose the one that has the smallest sum. Notice, however, that for lines that most closely align with the pattern on the data points, there is approximately equal distribution of points above and below the line. Thus as we’re trying to compare lines that pretty closely fit the data, we’d expect the residuals to sum to zero or very close to zero. This would make it difficult, then, to determine a best fit line.

Figure 4.5

Instead of using the sum of the residuals, we will choose the link that has the smallest sum of the squared residuals, \(e_1^2 + e_2^2 + \dots + e_n^2\) , where \(n\) is the number of observations in the data. This process is called least squares regression.

Terminology

The least squares regression line is the line, \(\hat{\beta}_0 + \hat{\beta}_1 ~X\) , that minimizes the sum of the squared residuals.

Now that we know the goal is to minimize the sum of squared residuals, so how do we find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that do so? Recall that the \(e_i\), the residual of the \(i^{th}\) observation is \(y_i - \hat{y}_i\), where \(\hat{y}_i\) is the estimated response. Filling in the regression equation we have

\[ \begin{aligned} e_i &= y_i - \hat{y}_i \\ &= y_i - (\beta_0 + \beta_1x_i) \end{aligned} \tag{4.9}\]

Extending Equation 4.9 to all observations and taking the sum of the squared residuals, we have

\[ \begin{aligned} \sum_{i=1}^n e_i^2 &= e_1^2 + e_2^2 + \dots + e_n^2 \\ &= [y_1 - (\beta_0 + \beta_1x_1)]^2 + [y_2 - (\beta_0 + \beta_1x_2)]^2 + & \\ & \dots + [y_n - (\beta_0 + \beta_1x_n)]^2 \end{aligned} \tag{4.10}\]

Using calculus, the values of \(\beta_0\) and \(\beta_1\) that minimize Equation 4.10 are

\[ \hat{\beta}_1 = r\frac{s_Y}{s_X} \hspace{10mm} \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X} \tag{4.11}\]

where \(\bar{X}\) and \(\bar{Y}\) are the means of the predictor and response, respectively, \(s_X\) and \(s_Y\) are the standard deviations of the predictor and response, respectively, and \(r\) is the correlation between the response and predictor variables. See Section A.1 for the full details of the derivation from Equation 4.10 to Equation 4.11.

Properties of least squares regression
  • The regression line goes through the center of mass point, the coordinates corresponding to average \(X\) and average \(Y\): \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\)

  • The slope has the same sign as the correlation coefficient: \(\hat{\beta}_1 = r\frac{s_Y}{s_X}\)

  • The sum of the residuals is zero: \(\sum_{i=1}^n \epsilon_i = 0\)

  • The residuals and \(X\) values are uncorrelated

4.5 Interpreting slope and intercept

The slope is the estimated change in the response for each unit increase in the predictor variable. But what do we mean by “estimated change”? Recall that the output from the regression equation is \({\mu}_{Y|X}\) the estimated mean of the response \(Y\) for a given value of \(X\). Thus, as we consider the slope or the “steepness” of the regression line or what happens as we move along the line, we are really looking at how the mean value of the response variable changes as the value of the predictor variable \(X\) changes. In other words, the slope is a measure of how much the response variable is expected to change for each unit increase of the predictor. This is how much the response variable is expected to change on average for each unit increase in the predictor.

It is good practice to write the interpretation of the slope in the context of the data, so that it can be more easily understood by a reader who isn’t as familiar with the data as you are as the data scientist. “In the context of the data” means

  • using the variable names or meaningful descriptions of the variables,
  • including the units in the interpretation, and
  • Indicating the population for which this model applies.
Your turn!

Recall Equation 4.7. What is the value of the slope? Interpret this value in the context of the data.5

The intercept is the estimated value of the response variable when the predictor variable equals zero \((X = 0)\). On a scatterplot of the response and predictor variable such as Figure 4.3, this is the point where the regression line crosses \(y\) axis. Similar to the slope, the “estimated value” is more specifically the estimated mean value of the response variable when \(X= 0\) ( \(\hat{\mu}_{Y|X = 0})\).

We always need a value of the intercept to get the line of best fit using least squares regression. The intercept, however, does not always have a meaningful interpretation. The intercept has a meaningful interpretation if two conditions are met.

  1. It makes sense for the predictor variable to take values at or near zero.

  2. There are observations with the predictor near zero in the data.

If either of these do not hold, then it is not meaningful, and potentially misleading, to interpret the intercept.

Your turn!

Recall Equation 4.7. What is the value of the intercept? Interpret this value in the context of the data. Is this the interpretation of the intercept meaningful?6

Don’t make causal (or declarative) statements!

Avoid using causal language and making declarative statements when interpreting the slope and intercept. Remember the slope and intercept are estimates describing what we expect the relationship between the response and predictor to be based on the sample data and simple linear regression model. They do not tell us exactly what will happen in the data. We would need to analyze all data in the population to know the exact values!

4.6 Prediction

In Section 4.3, we introduced two main uses for a regression line: prediction and inference. We will talk more about inference in the next chapter, so let’s focus on prediction for now.

When we use the regression model for prediction, we will get the estimated value of the response based on a given value of the predictor. Let’s take a look at the model predictions for two movies released in 2023.

Barbie, directed by Greta Gerwig, was released in theaters on July 21, 2023. This movie was widely praised by critics, as it has a critics score of 88. Based on Equation 4.7, the predicted audience score is

\[ \begin{aligned} \hat{audience} &= 32.3155 + 0.5187 \times 88 \\ &= \textbf{77.9611} \\ \end{aligned} \]

From the snapshot of the Barbie Rotten Tomatoes page (Figure 4.6), we see the actual audience score is 837. Therefore, the model under predicted the audience score by about 5 points (83 - 77.9611). Perhaps this isn’t surprising given this film’s massive success!

Figure 4.6: Source: https://www.rottentomatoes.com/m/barbie
Your turn!

Asteroid City, directed by Wes Anderson, was released in theaters on June 23, 2023. The critics score for this movie was 758.

  • What is the predicted audience score?

  • The actual audience score is 62. Did the model over or under predict? What is the residual? 9

We should only use the regression model to predict the response for values of the predictor within the range of the observed data used to fit the regression model. Using the model to predict for values far outside this range is called extrapolation. We only know the relationship between the response and predictor within the range of values in our data set. We can not safely assume that the linear relationship we quantify by our model holds far outside of this range. Therefore, if we extrapolate, we are producing unreliable predictions that could be misleading if the linear relationship, in fact, does not hold outside the range of the data.

Do not extrapolate!

Do not use the regression model to predict for values of the predictor far outside the range in the data used to fit the model. This is extrapolation and can result in unreliable predictions.

4.7 Model assessment

We have shown how a simple linear regression model can be used to describe the relationship between a response and predictor variable and to predict new values of the response. Now we will look at two statistics that will help us make an assessment about how well the model actually fits the data and thus explains variability in the response.

4.7.1 Root Mean Square Error

The Root Mean Square Error (RMSE), shown in Equation 4.12, is a measure of the average difference between the observed and predicted response.

\[ RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i - \hat{y}_i)^2}{n}} = \sqrt{\frac{\sum_{i=1}^ne_i^2}{n}} \tag{4.12}\]

This measure is especially useful if prediction is the primary modeling objective. The RMSE takes values from 0 to \(\infty\) (infinity) and has the same units as the response variable.

Your turn!

Do higher or lower values of RMSE indicate a better model fit?10

There is no universal threshold of RMSE to determine whether the model is a good fit. In fact, the RMSE is most useful when comparing multiple models. If you are using the RMSE to assess the fit of a single model, however, you should take into account to considerations:

  1. What is the range (max value - min value) of the response variable? How does the RMSE compare to the range of the data? On average, what is the error percentage?
  2. What is a reasonable error threshold based on the subject matter and analysis objectives? For example, you may be willing to use a model with higher RMSE for a low stakes analysis objective than a high stakes objective with major implications.
Your turn!

The RMSE for the movie scores model is 12.452. The range for the audience score is 74. Do you think the model is a good fit and the critics score is a useful predictor for the audience score? Explain your response.

4.7.2 Analysis of variance and \(R^2\)

The coefficient of determination, \(R^2\), is a measure of the percentage of variability in the response variable that is explained by the predictor variable. Before talking more about how \(R^2\) is used for model assessment, let’s discuss how this percentage is determined.

There is variability in the response variable, as we see in the exploratory data analysis in Section 4.2. Analysis of Variance (ANOVA), Equation 4.13, is the process of partitioning the sources of variability.

\[ \text{Total variability} = \text{Explained variability} + \text{Unexplained variability} \tag{4.13}\]

The variability in the response variable is from two sources:

  1. Explained variability (Model): This is the variability in the response variable that can be explained from the model. In the case of simple linear regression, it is the variability in the response variable that can be explained by the predictor variable.

  2. Unexplained variability (Residuals): This is the variability in the response variable that is left unexplained after the model is fit. This can be understood by assessing the variability in the residuals.

The variability in the response variable and the contribution from each source is quantified using sum of squares. In general, the sum of squares (SS) is a measure of how far the observations are from a given point, for example the mean. Using sum of squares, we can quantify the values from Equation 4.13.

Let \(SST\) = Sum of Squares Total, \(SSM\) = Sum of Squares Model, and \(SSR\) = Sum of Squares Residuals. Then,

\[ \begin{aligned} SST &= SSM + SSR \\[10pt] \sum_{i=1}^n (y_i - \bar{y})^2 &= \sum_{i=1}^n(\hat{y}_i - \bar{y})^2 + \sum_{i=1}^n(y_i - \hat{y}_i)^2 \end{aligned} \tag{4.14}\]

Let’s break down the components of Equation 4.14.

Sum of Squares Total (SST) \(= \sum_{i=1}^n(y_i - \bar{y})^2\), is an overall measure of how far the observed values of the response variable are from the mean value of the response \(\bar{y}\). The formula for SST may look familiar, as it is \((n-1)s_y^2\) , which equals\((n-1)\) times the variance of \(y\). SST can be partitioned into two pieces, Sum of Squares Model (SSM) and Sum of Squares Residuals (SSR).

Sum of Squares Model (SSM) \(= \sum_{i=1}^n(\hat{y}_i - \bar{y})^2\), is an overall measure of how much the value of the response predicted by the model (the expected mean value of the response given the predictor) differs from the overall mean value of the response. This indicates how much the observed response’s deviation from the mean is accounted for by knowing the value of the predictor.

Lastly, the Sum of Squares Residual (SSR) \(= \sum_{i=1}^n(y_i - \hat{y}_i)^2\), is an overall measure of how much the observed values of the response differ from what’s expected based on the model. This accounts for sources of variability other than the predictor variable.

We use the sum of squares to calculate the coefficient of determination \(R^2\)

\[ R^2 = \frac{SSM}{SST} = 1 - \frac{SSR}{SST} \tag{4.15}\]

Equation 4.15, shows that \(R^2\) is a the proportion of variability in the response (SST) that is explained by the model (SSM). Note that \(R^2\) is calculated as proportion between 0 and 1, but is reported as a percentage between 0% and 100%.

Your turn!

Do higher or lower values of \(R^2\) indicate a better model fit?11

Similar to RMSE, there is no universal threshold for what makes a “good” \(R^2\) value. When using \(R^2\) to determine if the model is a good fit, take into account what might be a reasonable fit given the subject matter.

Your turn!

The \(R^2\) for the movies model is 0.611.

  • Interpret this value in the context of the data.12

  • Do you think the critics score is a useful predictor for the audience score? Explain your response.

4.8 Simple linear regression in R

We fit linear regression models using the lm function, which is part of the stats R package (2024). We then use the tidy function from the broom R package (Robinson, Hayes, and Couch 2023) to display the results in a tidy format in which each row is a term in the model and each column is a property of that term.

We begin by using the library function to load broom into the R environment. The stats package is automatically into the R environment when R is opened, so we don’t need to load it here.

library(broom)

Then, we fit the linear model of the relationship using the movie_scores data with audience as the response and critics as the predictor (Equation 4.7).

lm(audience ~ critics, data = movie_scores)

Call:
lm(formula = audience ~ critics, data = movie_scores)

Coefficients:
(Intercept)      critics  
    32.3155       0.5187  

Next, display the model results in a tidy format, so we will build upon the code above by saving the model in an object called `movie_fit` and displaying the object. We will also use `movie_fit` to calculate predictions.

1movie_fit <- lm(audience ~ critics, data = movie_scores)
2tidy(movie_fit)
1
Save the model output as movie_fit.
2
Display the model output in a tidy format.
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

Notice the resulting the model is the same as Equation 4.7, which we calculated based on Equation 4.11.

Now that we have fit the model and saved the results as movie_fit in, let’s use it for prediction. Below is the code to predict the audience score for Barbie as we showed earlier.

1barbie_movie <- tibble(critics = 88)
2predict(movie_fit, barbie_movie)
1
Create a tibble, a data frame that modifies “some older behaviours to make life a little easier”(Wickham, Çetinkaya-Rundel, and Grolemund 2023), of the critics score for the Barbie movie. Note that the name of the column in the tibble must exactly match the name of the variable used in the code to fit the model.
2
The first argument of the predict() function is the object containing the model fit. The second argument is the newly created tibble line (1).
       1 
77.95917 

We can produce predictions for multiple movies by putting multiple values of the predictor in the tibble. In the code below we produce predictions for Barbie and Asteroid City.

1new_movies <- tibble(critics = c(88, 75))
2predict(movie_fit, new_movies)
1
Create a vector of the values of the predictor for the two observations we want to predict and save it using the same variable name as the predictor sued to fit the model.
2
Calculate predictions for each value in new_movies.
       1        2 
77.95917 71.21636 

  1. The population is all movies on the Rotten Tomatoes website. The sample is the set of 146 movies in our data set.↩︎

  2. The response variable is audience , the audience score. The predictor variable is critics , the critics score.↩︎

  3. Example: What do we expect the audience score to be for movies with a critics score of 75?↩︎

  4. Example: Is the critics score a useful predictor of the audience score?↩︎

  5. The slope is 0.5187. For each additional point in the critics score, the audience score is expected to increase by 0.5187 points, on average.↩︎

  6. The intercept is 32.3155. The expected audience score for movies with a critics score of 0 is 32.3155 points. This interpretation is meaningful, because it is plausible for a movie to have a critics score of 0 and there are observations with scores around 5, which is near 0 on the 0 - 100 point scale.↩︎

  7. Source: https://www.rottentomatoes.com/m/barbie Accessed on August 29, 2023.↩︎

  8. Source: https://www.rottentomatoes.com/m/asteroid_city Accessed on August 29, 2023.↩︎

  9. The predicted audience score is 32.3155 + 0.5187 * 75 = 71.218. The model over predicted. The residual is 62 - 71.218 = -9.218.↩︎

  10. Lower values indicate a better fit, with 0 indicating a perfect predictor of the response.↩︎

  11. Higher values of R^2 indicate a better model fit, as it means more of the variability in the response is being explained by the model.↩︎

  12. About 61.1 % of the variability in the audience score can be explained by the model (critics score).↩︎