4  Simple linear regression

Learning goals

  • Use exploratory data analysis to assess whether a simple linear regression is an appropriate model to describe the relationship between two variables
  • Estimate the slope and intercept for a simple linear regression model
  • Interpret the slope and intercept in the context of the data
  • Use the model to compute predictions and residuals
  • Evaluate model performance using RMSE and R2
  • Conduct simple linear regression using R

4.1 Introduction: Movie ratings

Reviews from movie critics can be helpful when determining whether a movie is high quality and well-made; however, it can sometimes be challenging to determine whether regular audience members will like a movie based on critics reviews. We would like a way to better understand the relationship between what movie critics and regular movie goers think about a movie, and ultimately predict how an audience will rate a movie based on its score from movie critics.

To do so, we will analyze data that contains the critics scores and audience scores for 146 movies released in 2014 and 2015. The scores are for every movie released in these years that have “a rating on Rotten Tomatoes, a RT User rating, a Metacritic score, a Metacritic User score, an IMDb score, and at least 30 fan reviews on Fandango” (). The analysis in this chapter focuses on scores from Rotten Tomatoes, a website for information and ratings on movies and television shows. The data were originally analyzed in the article “Be Suspicious of Online Movie Ratings, Especially Fandango’s” () on the former data journalism website FiveThirtyEight. The data are available in movie_scores.csv. The data set was adapted from the fandago data frame in the fivethirtyeight R package ().

We will focus on two variables for this analysis:

  • critics_score: The percentage of critics who have a favorable review of the movie. This is known as the “Tomatometer” score on the Rotten Tomatoes website. The possible values are 0 - 100.

  • audience_score: The percentage of users (regular movie-goers) on Rotten Tomatoes who have a favorable review of the movie. The possible values are 0 - 100.

Our goal is to use simple linear regression to model the relationship between the critics score and audience score. We want to use the model to

  • describe how the audience score is expected to change as the critics score changes.
  • predict the audience score for a movie based on its critics score.

Recall from , the response variable is the outcome of interest, meaning the variable we are interested in predicting and understanding its variability. It is also known as the outcome or dependent variable and is represented as Y. The predictor variable(s) is the variable (or variables) used to understand variability in the response. It is also known as the explanatory or independent variable and represented as X. The observed values of the response and predictor are represented as yi and xi, respectively.

What is the response variable for the movie scores analysis? What is the predictor variable?

4.2 Exploratory data analysis

Recall from that we begin analysis with exploratory data analysis (EDA) to better understand the data, the distributions of key variables, and relationships in the data before fitting the regression model. The exploratory data analysis here focuses on the two variables that will be in the regression model, critics and audience. In practice, however, we may want to explore other variables in the data set (for example, year in this analysis) to provide additional context later on as we interpret results from the regression model. We begin with univariate EDA (), exploring one variable at a time, then we’ll conduct bivariate EDA () to look at the relationship between the critics scores and audience scores.

4.2.1 Univariate EDA

The univariate distributions of critics_score and audience_score are visualized in and summarized in .

(a) Critics score
(b) Adience score
Figure 4.1: Univariate distributions of critics_score and audience_score
Table 4.1: Summary statistics for critics_score and audience_score
Variable Mean SD Min Q1 Median (Q2) Q3 Max Missing
critics_score 60.8 30.2 5 31.2 63.5 89 100 0
audience_score 63.9 20.0 20 50.0 66.5 81 94 0

The distribution of critics_score is left-skewed, meaning the movies in the data set are generally more favorably reviewed by critics (more observations with higher critics scores). Given the apparent skewness, the center is best described by the median score of 63.5 points. The interquartile range (IQR), the spread of the middle 50% of the distribution, is 57.8 points (Q3Q1=8931.2), so there is a lot of variability in the critics scores for the movies in the data. There are no apparent outliers, but we observe from the raw data that there are two notable observations of movies that have perfect critics scores of 100. There are no missing values of critics score.

Use the histogram in and summary statistics in to describe the distribution of the response variable audience_score.

4.2.2 Bivariate EDA

Now let’s look at the relationship between critics_score and audience_score. From , we use visualizations and summary statistics to examine the relationship between two quantitative variables. A scatterplot of the the audience score versus critics score is shown in . The predictor variable is on the x-axis (horizontal axis), and the response variable is on the y-axis (vertical axis).

Figure 4.2: Scatterplot of critics_score and audience_score

There is a positive, linear relationship between the critics scores and audience scores for the movies in our data. The correlation between these two variables is 0.78, indicating the relationship is strong. Therefore, we can generally expect the audience score to be higher for movies with higher critics scores. There are no apparent outliers, but there does appear to be more variability in the audience score for movies with lower critics scores than for those with higher critics scores.

4.3 Linear regression

In , we used visualizations and summary statistics to describe the relationship between two variables. The exploratory data analysis, however, does not tell us what the audience score is predicted to be for a given value of the critics score or how much the audience score is expected to change as the critics score changes. Therefore, we will fit a linear regression model to quantify the relationship between the two variables. Recall the general form of the linear regression model in . More specifically, when we have one predictor variable, we will fit a model of the form

(4.1)Y=β0+β1X+ϵϵN(0,σϵ2)

, called a simple linear regression (SLR) model, is the equation to model the relationship between one quantitative response variable and one predictor variable. For now we will focus on models with one quantitative predictor variable. In later chapters, we will introduce models with two or more predictors (), categorical predictors (), and models with a categorical response variable ().

We are generally interested in using regression models for two types of tasks:

  • Prediction: Finding the expected value of the response variable for given values of the predictor variable(s).
  • Inference: Drawing conclusions about the relationship between the response and predictor variables.

Suppose we fit a simple linear regression line to summarize the relationship between critics_score and audience_scores for movies.

  • What is an example of a prediction question that can be answered using a simple linear regression model?
  • What is an example of an inference question that can be answered using a simple linear regression model?

4.3.1 Statistical (theoretical) model

We expand on the concepts introduced in for the simple linear regression model. Suppose there is a response variable Y and a predictor variable X. The values of the response variable Y can be generated in the following way:

(4.2)Y=Model+Error

More specifically, we define the model as a function of the predictor X, Model=f(X), and error ϵ, such that

(4.3)Y=f(X)+ϵ

The function f(X) that describes the relationship between the response and predictor variables is the regression model. This is the model we will fit in later sections using equations and software. The error, ϵ, is how much the actual value of the response Y deviates from the value produced by the regression model, f(X). There is some randomness in ϵ, because not all observations with the same value of X have the same value of Y. For example, not all movies with a critics score of 70 have the same audience score.

is the general form of the equation to generate values of Y given values of X. In the context of simple linear regression, the function f(X) in is

(4.4)f(X)=μY|X=β0+β1X

where μY|X is the mean value of Y at a particular value of X, and β0 and β1 are the model coefficients. The error terms ϵ from are normally distributed with a mean of 0 and variance σϵ2, represented as N(0,σϵ2) (more on this in . The specification of the simple linear regression model written in terms of individual observations (xi,yi) is

(4.5)yi=β0+β1xi+ϵiϵiN(0,σϵ2)

such that yi is the response for the ith observation, xi is he predictor for the ith observation, and ϵi is the error for the ith observation. is the statistical model, also called the data-generating model or population-level model. It is the theoretical form of the model that describes exactly how to generate the values of the response Y given values of the predictor in the population. The model coefficients are the intercept β0 and the slope β1. Theσϵ2 is called the standard error. In practice we don’t know the exact values of β0, β1, and σϵ2, so our goal is to use sample data to estimate these values. We will focus on the coefficients β0 and β1 in this chapter. We discuss estimating σϵ2 in .

In simple linear regression, we use sample data to estimate a model to understand trends in the population.

What is the population in the movie scores analysis? What is the sample? Recall the definition of population and sample in .

4.3.2 Evaluating whether SLR is appropriate

Before doing any more calculations, we need to determine if the simple linear regression model is a reasonable choice to summarize the relationship between the response variable and predictor variable based on what we know about the data and what we’ve observed from the exploratory data analysis. Determining this early on can help prevent going in a wrong analysis direction if a linear regression model is obviously not a good choice for the data.

We can ask the following questions to evaluate whether simple linear regression is appropriate:

  • Will a linear regression model be practically useful? Does quantifying and interpreting the relationship between the variables make sense in this scenario?
  • Is the shape of the relationship reasonably linear?
  • Do the observations in the data represent the population of interest, or are there biases in the data that could limit conclusions drawn from the analysis?

Mathematical equations or statistical software can be used to fit a linear regression model between any two quantitative variables. Therefore it is upon the judgment of the data scientist to determine if it is reasonable to proceed with a linear regression model or if doing so might result in misleading conclusions about the data. If the answer is “no” to any of the questions above, consider if a different analysis technique is better for the data, or proceed with caution if using regression. If we proceed with regression, be transparent about some of the limitations of the conclusions.

From , the goal of this analysis is understand the relationship between the critics scores and audience score for movies on Rotten Tomatoes. Therefore, there is a practical use for fitting the regression model. We observed from that the relationship between the two variables is approximately linear, so it could reasonably be summarized a model of the form of . Lastly, the data set includes all movies in 2014 and 2015 that has a sufficient number of ratings on popular movie ratings websites, so we can reasonably conclude the sample is representative of the population of movies on Rotten Tomatoes. Therefore, we are comfortable drawing conclusions about the population based on the analysis of our sample data.

The form of the simple linear regression model for the movie scores data is

(4.6)audience_score=β0+β1 critics_score+ϵ,ϵN(0,σϵ2)

Now that we have the form of the model, let’s discuss how to estimate and interpret the model coefficients, the slope β1 and the intercept β0. We will estimate σϵ2 in .

4.4 Estimating the model coefficients

Ideally, we would have data from the entire population of movies rated on Rotten Tomatoes in order to calculate the exact values for β0 and β1. In reality we don’t have access to the data from the entire population, but we can use the sample to obtain the estimated regression equation in

(4.7)y^i=β^0+β^1xi

where β^0 is the estimated intercept, β^1 is the estimated slope, and y^i is the predicted (estimated) response.

Specifically for the movie scores analysis, the estimated regression equation is

(4.8)audience_score^i=32.316+0.519 critics_scorei

In this equation 32.316 is β^0, the estimated value for the intercept, 0.519 is β^1, the estimated value for the slope β1, and audience_score^i is the expected audience score when the critics score is equal to critics_scorei. Notice that and do not have have error term, ϵ. The output from the regression equation is f(X)^=μ^Y|X, the expected mean value of the response given a value of the predictor. Therefore, when we discuss the values of the response estimated using simple linear regression, what we are really talking about is what the value of the response variable is expected to be, on average, for a given value of the predictor variable.

From , we know that the value of the response is not necessarily the same for all observations with the same value of the predictor. For example, we wouldn’t expect (nor do we observe) the same audience score for every movie with a critics score of 70. We know there are other factors other than the critics score that are related to how an audience reacts to a movie. Our analysis, however, only takes into account the critics score, so we do not capture these additional factors in our regression equation . This is where the error terms come back in.

Once we computed estimates β^0 and β^1 for the regression equation, we can calculate how far the predicted values of the response produced by the regression equation differ from the actual values of the response variable observed in the data. This difference is called the residual, denoted ei.

shows the equation of the residual for the ith observation.

(4.9)ei=observedipredictedi=yiy^i

In the case of the movie scores data, the residual is the difference between the actual audience score and the audience score predicted by . For example, the 2015 movie Avengers: Age of Ultron received a critics score of yi=74. Therefore, using , the estimated (predicted) audience score is.

y^i=32.316+0.519×74=70.722.

The observed audience score is 86, so the residual is

ei=yiy^i=8670.722=15.278

Would you rather see a movie that has a positive or negative residual? Explain your response.

4.4.1 Least squares regression

There are many possible regression lines (infinitely many, in fact) that we could use to summarize the relationship between critics_score and audience_scores. We see some fo the potential lines represented in . So how did we determine the line that “best” fits the data is the one described by ? We’ll use the residuals to help us answer this question.

Figure 4.3: Potential regression lines for the relationship between critics_score and audience_score

The residuals, represented by the vertical dotted lines in , are a measure of the “error”, the difference between the observed value of the response and the value predicted from a regression model. The line that “best” fits the data is the one that generally results in the smallest overall error. One way to find the line with the smallest overall error is to add up all the residuals for each possible line in and choose the one that has the smallest sum. Notice, however, that for lines that seem to closely align with the trend of the data, there is approximately equal distribution of points above and below the line. Thus as we’re trying to compare lines that pretty closely fit the data, we’d expect the residuals to add up to a value very close to zero. This would make it difficult, then, to determine a best fit line.

Figure 4.4: Regression line of the relationship between critics_score and audience_score with residuals

Instead of using the sum of the residuals, we will instead consider the sum of the squared residuals in

(4.10)i=1nei2=e12+e22++en2

where n is the number of observations in the data. The line that “best” fits the data, then, is the line that minimizes . This is called the least squares regression model.

Let’s expand . Recall that ei, the residual of the ith observation, is yiy^i where y^i is the estimated response. Then,

(4.11)ei=yiy^i=yi(β^0+β^1xi)

Thus, putting into , we have

(4.12)i=1nei2=e12+e22++en2=[y1(β^0+β^1x1)]2+[y2(β^0+β^1x2)]2++[yn(β^0+β^1xn)]2

Using calculus, the β^0 and β^1 that minimize are

(4.13)β^1=rsYsXβ^0=y¯β^1x¯

where x¯ and y¯ are the mean values of the predictor and response variables, respectively, sX and sY are the standard deviations of the predictor and response variables, respectively, and r is the correlation between the response and predictor variables. See for the full details of the derivation from to .

show the calculations of slope and intercept for the movie scores model based on the summary statistics in . Note that the small differences in the values compared to are due to rounding (versus coefficients computed by software).

(4.14)β^1=0.78×20.030.2=0.517β^0=63.90.517×60.8=32.467

Below are a few properties of least-squares regression models.

  • The regression line goes through the center of mass point, the coordinates corresponding to average X and average Y: β^0=y¯β^1x¯

  • The slope has the same sign as the correlation coefficient: β^1=rsYsX

  • The sum of the residuals is zero: i=1nei=0

  • The residuals and values of the predictor are uncorrelated

4.5 Interpreting the model coefficients

The slope β^1 is the estimated change in the response for each unit increase in the predictor variable. What do we mean by “estimated change”? Recall that the output from the regression equation is μY|X the estimated mean of the response Y for a given value of the predictor X. Thus, the slope or the “steepness” of the regression line, is a measure of how much the response variable is expected to change, on average, for each unit increase of the predictor.

It is good practice to write the interpretation of the slope in the context of the data, so that it can be more easily understood by others reading the analysis results. “In the context of the data” means that the interpretation includes

  • meaningful descriptions of the variables, if the variable names would be unclear to an outside reader
  • units for each variable
  • an indication of the population for which the model applies.

The slope in of 0.519 is interpreted as the following:

For each additional point in the critics score, the audience score for movies on Rotten Tomatoes is expected to increase by 0.519 points, on average.

The intercept is the estimated value of the response variable when the predictor variable equals zero (xi=0). On a scatterplot of the response and predictor variable, this is the point where the regression line crosses the y-axis. Similar to the slope, the “estimated value” is more specifically the estimated mean value of the response variable when the predictor equals 0 ( μ^Y|X=0).

The intercept in of 32.316 is interpreted as the following:

The expected audience score for movies on Rotten Tomatoes with a critics score of 0 is 32.316 points.

We always need to estimate the intercept in ?eq-regressio to get the line that best fit using least squares regression. The intercept, however, does not always have a meaningful interpretation. We ask the following questions to determine if the intercept has a meaningful interpretation:

  1. Is it plausible for the predictor variable to take values at or near zero?

  2. Are there observations in the data with values of the predictor at or near zero?

If the answer to either question is no, then it is not meaningful, and potentially misleading, to interpret the intercept.

Is the interpretation of the intercept in meaningful? Briefly explain.

Avoid using causal language and making declarative statements (e.g., “The audience score for a movie with a critics score of 0 points will be 32.316 points.”) when interpreting the slope and intercept. Remember the slope and intercept are estimates describing what is expectedin the relationship between the response and predictor to be based on the sample data and linear regression model. They do not tell us exactly what will happen in the data.



There is an area of statistics called causal inference about model that can be used to make causal statements from observational (non-experimental) data. See for a brief introduction to causal inference.

4.6 Prediction

One of the primary purposes of a regression model is to use for prediction. When a regression model is used for prediction, the estimated value of the response variable is computed based on a given value of the predictor. We’ve seen this in earlier sections when calculating the residuals. Let’s take a look at the model predictions for two movies released in 2023.

The movie Barbie was released in theaters on July 21, 2023. This movie was widely praised by critics, and it has a critics score of 88 at the time the data were obtained. Based on , the predicted audience score is

audience_scores^=32.316+0.519×88=77.988

From the snapshot of the Barbie Rotten Tomatoes page (), we see the actual audience score is 83. Therefore, the model under predicted the audience score by about 5 points (83 - 77.988). Perhaps this isn’t surprising given this film’s massive box office success!

Figure 4.5: Source: https://www.rottentomatoes.com/m/barbie (accessed August 29, 2023)

The movie Asteroid City was released in theaters on June 23, 2023. The critics score for this movie was 75.

  • What is the predicted audience score?

  • The actual audience score is 62. Did the model over or under predict? What is the residual?

The regression model is most reliable when predicting the response for values of the predictor within the range of the sample data used to fit the regression model. Using the model to predict for values far outside this range is called extrapolation. The sample data provide information about the relationship between the response and predictor variables for values within the range of the predictor in the data. We can not safely assume that the linear relationship quantified by our model is the same for values of the predictor far outside of this range. Therefore, extrapolation often results in unreliable predictions that could be misleading if the linear relationship does not hold outside the range of the sample data.

Only use the regression model to compute predictions for values of the predictor that are within (or very close) to the range of values in the sample data used to fit the model. Extrapolation, using a model to compute predictions for value so the predictor far outside the range in the data, can result in unreliable predictions.

4.7 Model evaluation

We have shown how a simple linear regression model can be used to describe the relationship between a response and predictor variable and to predict new values of the response. Now we will look at two statistics that will help us evaluate how well the model fits the data and how well it explains variability in the response.

4.7.1 Root Mean Square Error

The Root Mean Square Error (RMSE), shown in , is a measure of the average difference between the observed and predicted values of the response variable.

(4.15)RMSE=i=1nei2n=i=1n(yiy^i)2n

This measure is especially useful if prediction is the primary modeling objective. The RMSE takes values from 0 to (infinity) and has the same units as the response variable.

Do higher or lower values of RMSE indicate a better model fit?

There is no universal threshold of RMSE to determine whether the model is a good fit. In fact, the RMSE is often most useful when comparing the performance of multiple models. Take the following into account when using RMSE to evaluate model fit.

  1. What is the range (maxmin) of the response variable in the data? How does the RMSE compare to the range? For example, RMSE=10 indicates very good model performance if the response variable ranges from 10000 to 20000, but very poor model performance if the response variable ranges from 0 to 20.
  2. What is a reasonable error threshold based on the subject matter and analysis objectives? We may be willing to use a model with higher RMSE for a low-stakes analysis objective (for example, the model is used to inform the choices of movie-goers) than a high-stakes objective (the model is used to inform how a movie studio’s multi-million dollar marketing budget will be allocated).

The RMSE for the movie scores model is 12.452. The range for the audience score is -. What is your evaluation of the model fit based on RMSE? Explain your response.

4.7.2 Analysis of variance and R2

The coefficient of determination, R2, the percentage of variability in the response variable that is explained by the predictor variable. In terms of the movie scores data, it is the percentage of variability in the audience score that is accounted for by changes in the critics score. Before talking more about how R2 is used for model evaluation, let’s discuss how this percentage is calculated.

There is variability in the response variable, as we see in the exploratory data analysis in and . Analysis of Variance (ANOVA), shown in , is the process of partitioning the various sources of variability.

(4.16)Total variability=Explained variability+Unexplained variability

From , the variability in the response variable is from two sources:

  1. Explained variability (Model): This is the variability in the response variable that can be explained by the model. In the case of simple linear regression, it is the variability in the response variable that can be explained by the predictor variable. In the movie scores analysis, this is the variability in audience_score that is explained by the critics_score.

  2. Unexplained variability (Residuals): This is the variability in the response variable that is left unexplained after the model is fit. This can be understood by assessing the variability in the residuals. In the movie scores analysis, this is the variability due to the factors other than critics score.

The variability in the response variable and the contribution from each source is quantified using sum of squares. In general, the sum of squares (SS) is a measure of how far the observations are from a given point, for example the mean. Using sum of squares, we can quantify the components of .

Let SST = Sum of Squares Total, SSM = Sum of Squares Model, and SSR = Sum of Squares Residuals. Then,

(4.17)SST=SSM+SSRi=1n(yiy¯)2=i=1n(y^iy¯)2+i=1n(yiy^i)2

Sum of Squares Total (SST) =i=1n(yiy¯)2, is the total variability, an overall measure of how far the observed values of the response variable are from the mean value of the response y¯. The formula for SST may look familiar, as it is (n1)sy2 , which equals(n1) times the variance of y. SST can be partitioned into two pieces, Sum of Squares Model (SSM) and Sum of Squares Residuals (SSR).

Sum of Squares Model (SSM) =i=1n(y^iy¯)2, is the explained variability, an overall measure of how much the predicted value of the response variable (the expected mean value of the response given the predictor) differs from the overall mean value of the response. This indicates how much the observed response’s deviation from the mean is accounted for by knowing the value of the predictor.

Lastly, the Sum of Squares Residual (SSR) =i=1n(yiy^i)2, is the unexplained variability, an overall measure of how much the observed values of the response differ from the predicted values. This is the same sum of squared residuals used to estimate the least-squares regression model in .

We use the sum of squares to calculate the coefficient of determination R2

(4.18)R2=SSMSST=1SSRSST

, shows that R2 is a the proportion of variability in the response (SST) that is explained by the model (SSM). Note that R2 is calculated as proportion between 0 and 1, but is reported as a percentage between 0% and 100%.

The R2 for the model in is 0.611. It is interpreted as the following:

About 61.1% of the variability in the audience score for movies on Rotten Tomatoes can be explained by the model (critics score).

Do higher or lower values of R2 indicate a better model fit?

Similar to RMSE, there is no universal threshold for what makes a “good” R2 value. When using R2 to determine if the model is a good fit, take into account what might be a reasonable to expect given the subject matter.

4.8 Simple linear regression in R

4.8.1 Fitting the least-squares model

We fit linear regression models using the lm function, which is part of the stats package () built into R. We then use the tidy function from the broom package () to display the results in a tidy data format (). The code to find the linear regression model using the movie_scores data with audience_score as the response and critics_score as the predictor () is below.

lm(audience_score ~ critics_score, data = movie_scores)

Call:
lm(formula = audience_score ~ critics_score, data = movie_scores)

Coefficients:
  (Intercept)  critics_score  
       32.316          0.519  

Next, we want to display the model results in a tidy format. We build upon the code above by saving the model in an object called movie_fit and displaying the object. We will also use movie_fit to calculate predictions.

movie_fit <- lm(audience_score ~ critics_score, data = movie_scores) 

tidy(movie_fit) 
# A tibble: 2 × 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     32.3      2.34        13.8 4.03e-28
2 critics_score    0.519    0.0345      15.0 2.70e-31

Notice the resulting the model is the same as , which we calculated based on . We will discuss the other columns in the output in .

We can also use kable() from the knitr package () to display the tidy results in an neatly formatted table and control the number of digits in the output.

tidy(movie_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 32.316 2.343 13.8 0
critics_score 0.519 0.035 15.0 0

4.8.2 Prediction

Below is the code to predict the audience score for Barbie as shown earlier in the section. We create a tibble that contains the critics score for Barbie, then use predict() and the model object to compute the prediction.

barbie_movie <- tibble(critics_score = 88) 
predict(movie_fit, barbie_movie)
 1 
78 

We can also produce predictions for multiple movies by putting multiple values of the predictor in the tibble. In the code below we produce predictions for Barbie and Asteroid City. We begin by storing the critics scores for both movies in a tibble. Then we use predict(), as before.

new_movies <- tibble(critics_score = c(88, 75)) 
predict(movie_fit, new_movies) 
   1    2 
78.0 71.2 

4.8.3 R2 and RMSE

The glance() function in the broom package produces model summary statistics, including R2.

glance(movie_fit)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.611         0.608  12.5      226. 2.70e-31     1  -575. 1157. 1166.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The code below will only return R2 from the output of glance().

glance(movie_fit)$r.squared
[1] 0.611

RMSE is computed using the rmse() function from the yardstick package (). First, we use augment() from the broom package to compute the predicted value for each observation in the data set. These values are stored if the column .fitted. We may notice that many other columns are produced by augment() as well; these are discussed in . We input the augmented data into rmse().

movies_augment <- augment(movie_fit) 

rmse(movies_augment, truth = audience_score, estimate = .fitted) 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        12.5

4.9 Summary

In this chapter, we introduced simple linear regression. We showed how to use exploratory data analysis to evaluate whether linear regression is appropriate to model the relationship between two variables. Next, we computed the slope and intercept (the model coefficients) and interpreted these values in in the context of the data. We used the model to compute predictions and evaluated the model performance using R2 and RMSE. We finished the chapter by conducting simple linear regression in R.

This chapter has helped set the foundation for all the regression methods presented throughout the remainder of the text. In , we’ll use the simple linear regression model to draw conclusions about the relationship between the response and predictor variables.


  1. The response variable is audience, the audience score. The predictor variable is critics, the critics score.↩︎

  2. The distribution of audience_score is unimodal and left-skewed. The median score is 66.5 and the IQR is 31 (81 - 50). We note that the center is higher and there is less variability in the middle 50% of the distribution compared to critics_score .↩︎

  3. Example prediction question: What do we expect the audience score to be for movies with a critics score of 75?
    Example inference question Is the critics score a useful predictor of the audience score?↩︎

  4. The population is all movies on the Rotten Tomatoes website. The sample is the set of 146 movies in the data set.↩︎

  5. Example answer: I would rather see a movie with a positive residual, because that means the audience actually rated the movie more favorably than what was expected based on the model.↩︎

  6. The interpretation of the intercept is meaningful, because it is plausible for a movie to have a critics score of 0 and there are observations with scores around 5, which is near 0 on the 0 - 100 point scale.↩︎

  7. Source: https://www.rottentomatoes.com/m/barbie Accessed on August 29, 2023.↩︎

  8. Source: https://www.rottentomatoes.com/m/asteroid_city Accessed on August 29, 2023.↩︎

  9. The predicted audience score is 32.316 + 0.519 * 75 = 71.241. The model over predicted. The residual is 62 - 71.241 = -9.241.↩︎

  10. Lower values indicate a better fit, with 0 indicating the predictor variable perfectly predicts the response.↩︎

  11. Example answer: An error of 12.452 is about a 17% error based on the range of the audience scores. Because the audience scores range 0 to 100, this error seems relatively large.↩︎

  12. Higher values of R2 indicate a better model fit, as it means more of the variability in the response is being explained by the model.↩︎