5  Inference for simple linear regression

Learning outcomes

  • Describe how statistical inference is used to draw conclusions about a population slope
  • Construct confidence intervals using bootstrap simulation
  • Conduct hypothesis tests using permutation
  • Describe how the Central Limit Theorem is applied to the slope
  • Conduct statistical inference on the slope using mathematical models based on the Central Limit Theorem
  • Interpret results from statistical inference in the context of the data
  • Understand the connection between hypothesis tests and confidence intervals

R Packages

5.1 Introduction: Access to playgrounds

The Trust for Public Land is a non-profit organization that advocates for equitable access to outdoor spaces in cities across the United States. In the 2021 report Parks and an Equitable Recovery (The Trust for Public Land 2021), the organization declares that “parks are not just a nicety—they are a necessity”. The report details the many health, social, and environmental benefits of having ample access to parks in cities along with the various factors that impede the access to parks for some.

One type of outdoor space the authors study in their report is playgrounds. The report describes playgrounds as one type of outdoor space that “bring children and adults together” (The Trust for Public Land 2021, 13)and a place that was important for distributing “fresh food and prepared meals to those in need, particularly school-aged children” (The Trust for Public Land 2021, 9) during the global COVID-19 pandemic.

Given the impact of playgrounds for both children and adults in a community, we will focus on understanding access to playgrounds in this chapter. In particular, we want to (1) investigate whether spending is helpful in understanding variability in playground access, and if so, (2) quantify the true relationship between spending and playground access.

The data includes information on 97 of the most populated cities in the United States in the year 2020. The data were originally collected by the Trust for Public Land and was a featured as part of the TidyTuesday weekly data visualization challenge. The analysis in this chapter will focus on two variables:

  • spend: Total amount the city spends per resident in 2020 (in US dollars)

  • playgrounds : Number of playgrounds per 10,000 residents in 2020

Your turn!

Do you expect the relationship between spending to be positive, negative, or no relationship? Why?

5.1.1 Exploratory data analysis

Figure 5.1: Univariate exploratory data analysis
Figure 5.2: Bivariate exploratory data analysis

Froom Figure 5.2 we see a positive relationship between spending per resident and the number of playgrounds per 10,000 residents. The correlation is 0.206, so perhaps the relationship between the variables is relatively weak.

  • Something about to explore this relationship we fit the model

    \[ playgrounds = \beta_0 + \beta_1~spend + \epsilon, \hspace{5mm} \epsilon \sim N(0, \sigma^2_{\epsilon}) \tag{5.1}\]

Table 5.1: Linear regression model between spending per resident and number of playgrounds per 10,000 residents
term estimate std.error statistic p.value
(Intercept) 2.418 0.214 11.282 0.000
spend 0.003 0.002 2.057 0.042

We see from the output in Table 5.1 that regression equation for the relationship between spending and playgrounds per 10,000 residents is

\[ \hat{\text{playgrounds}} = 2.418 + 0.003 \times \text{spend} \tag{5.2}\]

Your turn!

Interpret the slope and intercept in the context of the data.

From our sample of 97 cities in 2020, we have an estimated slope of 0.0032961. This estimated slope is likely close but not exactly the value of the true population slope. Based on the equation alone, we are also not sure if this slope indicates a meaningful relationship between the two variables, or if this value occurred by random chance. Therefore, we will use statistical inference methods in order to begin to answer these questions.

5.2 Objectives of statistical inference

In Table 5.1 we see the output of the regression model for which spend is used to explain variability in the playgrounds per 10,000 residents of cities. For example, based on this model, for each dollar in spending per resident, we the number of playgrounds per 10,000 residents to increase by 0.003, on average.

The estimate 0.003 is our “best guest” of the relationship between spending per resident and the number of playgrounds per 10,000 residents; however, this is likely not the exact value of the relationship in the population of all cities. Therefore, we will use statistical inference in which we draw conclusions about the population parameters based on the analysis of the sample data. There are two different types of statistical inference procedures:

  • Hypothesis tests: A test about a specific claim about the population parameter

  • Confidence intervals: A plausible range of values the population parameter can take

In this chapter we will discuss how to conduct each of these inferential procedures, what conclusions can be drawn from each, and how they are related to one another.

As we’ll see throughout the chapter, a key part of statistical inference is quantifying the sampling variability, sample-to-sample variability in the statistic that is the “best guest” estimate for the parameter. For example, when we conduct doing statistical inference on the slope of spending per resident \(\beta_1\), we need to quantify the sampling the variability of the statistic \(\hat{\beta}_1\). In other words, we need to quantify the amount of variability in \(\hat{\beta}_1\) that is expected if we repeatedly (1) took samples of size 97 and (2) used the samples to fit a model using the spending per resident to predict playgrounds per 10,000 residents. Thus, there are two approaches to statistical inference that are distinguished by the way the sampling variability is quantified.

  • Simulation-based methods: Quantifying the sampling variability by generating a sampling distribution from the sample data

  • Central Limit Theorem - based methods: Quantify the sampling variability using mathematical models based on the Central Limit Theorem

We will describe how to conduct hypothesis testing and construct confidence intervals using each approach.

We can use both approaches for hypothesis testing and confidence intervals. Before we get into those details, however, let’s introduce more of the foundational ideas underlying simple linear regression as they relate to statistical inference.

Statistical inference

The goal of statistical inference is to use sample data to draw conclusions about a population.

5.3 Foundations for simple linear regression

In Section 4.3.1, we introduced the statistical model for simple linear regression \[ Y = \beta_0 + \beta_1 X + \epsilon \hspace{8mm} \epsilon \sim N(0, \sigma^2_{\epsilon}) \tag{5.3}\]

where \(Y\) is the response variable, \(X\) is the predictor variable, and \(\epsilon\) is the error term. Equation 5.3 can be rewritten in terms of the distribution of the response variable \(Y\) given the predictor \(X\)

\[Y|X \sim N(\beta_0 + \beta_1X , \sigma^2_\epsilon) \tag{5.4}\]

Equation 5.4 is the assumed distribution of the response variable conditional on the predictor variable under simple linear regression. From the equation we can specify the assumptions that are made when we conduct simple linear regressions.

  1. The distribution of the response \(Y\) is Normal for a given value of the predictor \(X\).

  2. The mean of the conditional distribution of \(Y\) given \(X\) is determined using the equation of the line \(\beta_0 + \beta_1 X\), thus indicating a linear relationship between the response and predictor variables.

  3. The variance of the conditional distribution of \(Y\) given \(X\) is \(\sigma^2_{\epsilon}\). This variance does not depend on \(X\) and thus is equal for all values of \(X\).

  4. The error terms \(\epsilon\) are independent of one another. This implies the observations are also independent.

Whenever we fit linear regression models and conduct inference on the slope, we do so under the assumption that some or all of these four statements hold. In (ch-slr-conditions?), we will discuss how to check if these assumptions hold for a set a data. As we might expect, these assumptions do not always perfectly hold in practice, so we will also discuss circumstances in which each assumption is necessary versus when some can be relaxed. For the remainder of this chapter, however, we will proceed as if all these assumptions hold.

5.4 Simulation-based inference

  • Using simulation to get the variability in the estimated slope \(\hat{\beta}_1\)

  • Sketch of what happens when we do simulation-based inference

5.5 Bootstrap confidence intervals

A confidence interval is a plausible range of values a population parameter ( \(\beta_1\) for our analysis) takes; it is determined from by the sample data and statistical methods. By calculating this range, we are more likely to capture the true value of the population parameter than if we merely rely on a single estimated value (called a point estimate). .

In order to obtain this range of values we must first get an understanding of the of the sampling variability of the statistic - the variability if we were to repeatedly take samples of size \(n\) (the same size as the sample data) and fit regression models to estimate \(\hat{\beta}_1\). In practice, it is often not feasible to collect and analyze new sample data repeatedly repeatedly, so we will use our sample data to generate these new samples. We generate these samples using bootstrapping, a simulation process in which we sample with replacement such that each bootstrapped sample is size \(n\), the size of the original sample data.

Your turn!

Why do we sample with replacement when doing bootstrapping? What would happen if we sampled without replacement?1

5.5.1 Constructing a bootstrap confidence interval for \(\beta_1\)

We use the following steps to construct a confidence interval for \(\beta_1\) based on the bootstrap method described in the previous section:

  1. Generate \(n_{iter}\) bootstrap samples, where \(n_{iter}\) is the number of iterations. We typically want to use at least 1000 iterations in order to construct a sampling distribution that is close to the theoretical distribution defined in Section 5.9.
  2. Fit the linear model to each of the \(n_{iter}\) bootstrap samples to obtain \(n_{iter}\) values of \(\hat{\beta}_1\), the estimated slope. There will also be \(n_{iter}\) estimates of \(\beta_0\), but we will ignore those for now since we are not focusing on inference for the intercept.
  3. Collect the \(n_{iter}\) values \(\hat{\beta}_1\) from the previous step to obtain the bootstrapped sampling distribution. It is an approximation of the sampling distribution of \(\hat{\beta}_1\), and thus can be used to understand the expected sample-to-sample variability in \(\hat{\beta}_1\).
  4. Use the distribution from the previous step to calculate the \(C\%\) confidence interval. The lower and upper bounds are calculated as the points on the distribution that mark the middle \(C\%\) of the distribution.

Let’s demonstrate these four steps by calculating a 95% confidence interval for \(\beta_1\), the slope for the total spending per resident in Equation 5.1.

  1. We generate 1000 bootstrap samples by sampling with replacement from our sample data of 97 observations. The first 10 rows of the first bootstrapped sample are shown in Table 5.2. These are the first 10 observations from the first bootstrap sample.
Table 5.2: First 10 rows of the first bootstrapped sample. The replicate column identifies which bootstrap sample.
replicate playgrounds spend
1 2.1 320
1 1.8 65
1 2.2 67
1 1.0 33
1 2.6 42
1 2.2 149
1 3.3 73
1 2.2 35
1 1.8 89
1 1.3 65
Your turn!

How many observations are in each bootstrap sample?2

  1. Next, we fit a linear model of the form in Equation 5.1 to each of the 1000 bootstrap samples. The estimated coefficients for the first two bootstrap samples are shown in Table 5.3.

    Table 5.3: Estimated slope and intercept for the first two bootstrap samples.
    replicate term estimate
    1 intercept 2.3825970
    1 spend 0.0049071
    2 intercept 2.5451734
    2 spend 0.0023708
  2. We are focused on inference for the slope of spend, so we collect estimated slopes of spend and construct the bootstrap distribution. This is the approximation of the sampling distribution of \(\hat{\beta}_1\) . A histogram and summary statistics for this distribution are shown in Figure 5.3 and Table 5.4, respectively.

    Figure 5.3: Bootstrap distribution of spend
    Table 5.4: Summary statistics for bootstrap distribution of spend
    Min Q1 Median Q3 Max Mean Std.Dev.
    -0.001 0.002 0.003 0.002 0.01 0.004 0.002
Your turn!

How many values of \(\hat{\beta}_1\) make up the bootstrap sampling distribution shown in Figure 5.3 and summarized in Table 5.4 ?3

  1. As the final step, we use the bootstrap distribution to calculate the lower and upper bounds of the 95% confidence interval. These bounds are calculated as the points that mark off the middle 95% of the distribution. These are the points that the \(2.5^{th}\) and \(97.5^{th}\) percentiles as shown by the vertical lines in Figure 5.4.
Figure 5.4: Bootstrap confidence interval for spend
Table 5.5: 95% confidence interval for the slope of spend
Lower bound Upper bound
0.001 0.007

The 95% bootstrapped confidence interval for \(\beta_1\), the slope of spend is 0.001 to 0.007.

Your turn!

The points at what percentiles in the bootstrap distribution mark the lower and upper bounds for a

  • 90% confidence interval?4

  • 98% confidence interval?5

5.5.2 Interpreting the interval

The basic interpretation for the 95% confidence interval for \(\beta_1\), the slope of spend is

We are 95% confident that the interval 0.001 to 0.007 contains the population coefficient for spend in the model of the relationship between spending per resident and number of playgrounds per 10,000 residents.

Though this interpretation provides information about the range of plausible values for the slope of spend, it still requires the reader to recall what that means about the relationship between spending per resident and playgrounds per 10,000 residents in Equation 5.1. It is more informative, then, to interpret the confidence interval in a way that also utilizes the basic interpretation of the slope, so it is clear to the reader exactly what the confidence interval means. Thus, a more complete and informative interpretation of the confidence interval is as follows:

We are 95% confident that for each additional dollar a city spends per residents, the number of playgrounds per 10,000 residents is greater by 0.001 to 0.007, on average.

This interpretation not only indicates the range of values we estimate the population coefficient takes but also clearly describes what this range means in terms of the variability in playgrounds per 10,000 residents as spending per resident increases.

5.5.3 What does confidence mean?

The beginning of the interpretation for a confidence interval is “We are C% confident…”. What do we mean by “C% confident”? The “confidence” referenced here is in the statistical method we used, in this case the bootstrap confidence interval. This means if we were to replicate our process - obtain a sample of 97 cities, construct a bootstrap distribution for \(\hat{\beta}_1\), calculate the bounds that mark the middle C% of the distribution - many, many times, the intervals defined by the upper and lower bounds would include the population slope \(\beta_1\) C% of the time.

In reality we don’t know the value of the population slope (if we did, we wouldn’t need statistical inference!), so we’re not sure if the interval we constructed is one of the C% that actually contains the population slope or not. Though we aren’t certain that our interval contains the population slope, we can conclude with some level of confidence, C% confident to be exact, that we think it does.

By the confidence interval, we produced a plausible range of values for the population slope. We can also test specific claims about the population slope using another inferential procedure called hypothesis testing.

5.6 Hypothesis testing

Hypothesis testing is the process of assessing a statistical claim, or something we think is true, about about a population parameter. The claim could be based on previous research, an idea a research or business team wants to assess, or a general statement about the parameter. For now we will focus on conducting hypothesis tests for a slope \(\beta_1\). Just as bootstrapping can be used to calculate a confidence interval for the intercept \(\beta_0\), we can also use the process detailed in this section to test a claim about the intercept; however, this is often not meaningful in practice.

Before digging into the details of the simulation-based hypothesis test, let’s talk about what happens conceptually when we do hypothesis testing. To do so we’ll use a common analogy for hypothesis testing, the general process of a trial in the United States (U.S.) judicial system to help illustrate the steps for a hypothesis test.

Define the hypotheses


The first step to any hypothesis test (or trial) is to define the hypotheses to be evaluated. There are two hypotheses are called the null and alternative. The null hypothesis is the baseline condition typically indicating no relationship, and the alternative hypothesis is defined by the claim being tested. In the U.S. judicial system, a defendant is deemed innocent unless proven otherwise. The null hypothesis in this scenario, then, is that the defendant is innocent or or not guilty. The claim being tested is that the defendant is guilty. This the alternative hypothesis. In the U.S. judicial system, we say that a person is “innocent until proven guilty beyond a reasonable doubt.” Therefore, as the trial (or hypothesis test) proceeds, assuming the null hypothesis is true and the objective is to evaluate the strength of evidence against the null hypothesis.

Evaluate the evidence

The primary component of a hypothesis test (or trial) is a presentation and evaluation of the evidence. In a trial, this is the point when the evidence is presented and it is up to the jury to evaluate the evidence under the assumption the null hypothesis, the defendant is innocent, is true. Thus the lens in which the evidence is being evaluated is “given the defendant is innocent, how likely is it that this evidence would exist?”

For example, suppose someone is on trial for a jewelry store robbery. The null hypothesis is that they are innocent and did not rob the jewelry store. The alternative hypothesis is they are guilty and did rob the jewelry store. If there is evidence that the person was in a different city during the time of the jewelry store robbery, the evidence would be more in favor of the null hypothesis of innocence. Alternatively, if some of the stolen jewelry was found in the person’s car, the evidence would be less in favor of the null hypothesis and more in favor of the alternative.

In hypothesis testing, the “evidence” being assessed is the analysis of the sample data. Thus the evaluation question being asked is “given the null hypothesis is true, how likely are the results observed in the sample data?” We will introduce approaches to address this question using simulation-based methods in Section 5.7 and methods based on the Central Limit Theorem in ?sec-slr-testing-clt.

Make a conclusion

There are two usual conclusion in a trial in the U.S. judicial system - the jury concludes that the defendant is guilty or not guilty based on the evidence. As the criteria in court is that the strength of evidence must be “beyond reasonable doubt”, that is the threshold the jury uses in order to make their conclusion. If there is sufficiently strong evidence against the null hypothesis of innocence, then they conclude the alternative hypothesis that the defendant is guilty. Otherwise, they conclude that the defendant is not guilty, indicating the evidence against the null was not strong enough to otherwise refute it. Note that this is the not the same as “accepting” the null hypothesis but rather stating that there wasn’t enough evidence to suggest otherwise.

Similarly in hypothesis testing, we will use statistical criteria to determine if the evidence against the null hypothesis is strong enough to reject this hypothesis and conclude the alternative, or if there is not enough evidence “beyond a reasonable doubt” to draw a conclusion other than the assumed null hypothesis.

5.7 Permutation tests

Now that we understand the general process of hypothesis testing, let’s take a look at hypothesis testing using a simulation-based approach, called a permutation test.

The four steps permutation test for a slope coefficient \(\beta_1\) are

  1. State the null and alternative hypotheses.
  2. Generate the null distribution.
  3. Calculate the p-value.
  4. Draw a conclusion.

We’ll discuss each of these steps in detail.

5.7.1 State the hypotheses

As defined in Section 5.6 the null hypothesis (\(H_0\)) is baseline condition, and the alternative hypothesis (\(H_a\)) is defined by the claim being tested. Recall from Section 5.1 that one objective for the analysis in this chapter is to investigate whether spending is helpful in understanding variability in playground access. Therefore in terms of the regression model, the claim being tested is whether there is a linear relationship between the spending per resident and the number of playgrounds per 10,000 residents. We will use this claim to define the alternative hypothesis. The null hypothesis is the baseline condition of there being no linear relationship between the two variables.

  • Null hypothesis: There is no linear relationship between playgrounds per 10,000 residents and spending per resident. The coefficient of spend is 0.
  • Alternative hypothesis: There is a linear relationship between playgrounds per 10,000 residents and spending per resident. The coefficient of spend is not equal to 0.

The hypotheses are defined specifically in terms of the linear relationship between the two variables, because we are ultimately drawing conclusions about the slope \(\beta_1\). The null and alternative hypotheses state in mathematical notation are

\[ \begin{aligned} &H_0: \beta_1 = 0\\ &H_a: \beta_1 \neq 0 \end{aligned} \tag{5.5}\]

Note 5.1: Mathematical statement of hypotheses

Suppose there is a response variable \(Y\) and a predictor variable \(X\) such that

\[ Y = \beta_0 + \beta_1X + \epsilon_i, \hspace{5mm} \epsilon \sim N(0, \sigma^2_\epsilon) \]

The hypotheses for testing whether there is truly a linear relationship between \(X\) and \(Y\)are

\[ \begin{aligned} &H_0: \beta_1 = 0\\ &H_a: \beta_1 \neq 0 \end{aligned} \]

5.7.1.1 One vs. two-sided hypotheses

The alternative hypothesis defined in Note 5.1 and Equation 5.5 is “not equal to 0”. This is the alternative hypothesis corresponding to a two-sided hypothesis test, because it includes in the scenarios in which \(\beta_1\) is less than or greater than 0. There are two other options for defining the alternative hypothesis, \(\beta_1\) “less than 0” (\(H_a: \beta_1 < 0\)) and \(\beta_1\) “greater than 0” (\(H_a: \beta_1 > 0\)), corresponding to one-sided hypothesis tests, as they only consider the alternative scenario in which \(\beta_1\) is less than or greater than 0.

A one-sided hypothesis test imposes some information about the direction of the parameter, e.g., that is positive (\(> 0\)) or negative ( \(< 0\)). Given this information imposed by the direction of the alternative hypothesis, it requires less evidence to reject the null hypothesis in favor of the alternative. Therefore, such a one-sided hypothesis should only be used if (1) there is some indication from previous knowledge or research that the relationship between the response variable and the predictor being tested is in a particular direction, or (2) only one direction of the relationship between the response and predictor being tested is relevant for the research. Outside of these two scenarios, it is not advisable to use the one-sided hypothesis, as there could appear to be a statistically significant relationship between the two variables merely by chance of how the hypotheses were constructed.

A two-sided hypothesis test makes no assumption about the direction of the relationship between the response variable and predictor being tested. Therefore, it is a good starting point for drawing conclusions about the relationship between a given response and predictor variable. From the two-sided hypothesis, we will conclude whether there is or is not sufficient statistical evidence of a linear relationship between the response and predictor \((\beta_1 \neq 0)\). With this conclusion, we cannot determine if the relationship between the variables is positive or negative without additional analysis. As we’ll see in Section 5.8, we can use a confidence interval from Section 8.6.2 to make specific conclusions about the direction (and magnitude) of the relationship.

5.7.2 Generate the null distribution

In the same way a trial in the U.S. judicial system is conducted assuming a defendant is innocent unless there is sufficient evidence proving otherwise, hypothesis tests are conducted assuming the null hypothesis \(H_0\) is true with the objective of assessing the strength of evidence against the null hypothesis. Based on the hypotheses defined in Section 5.7.1, a hypothesis test for the slope is conducted under the assumption that \(\beta_1 = 0\) , no linear relationship between the response and predictor variables.

To assess the evidence, we need to use a simulation-based method to approximate the sampling distribution of the sample slope \(\hat{\beta}_1\) , assuming \(H_0: \beta_1 = 0\) is true. This distribution, called the null distribution, allows us to understand the sample-to-sample variability under the scenario in which the true population slope equals 0. To get this new distribution, we need a new simulation method, called permutation sampling.

In permutation sampling the values of the predictor variable are randomly paired with values of the response, creating a new sample the same size as the original data. The process of randomly pairing the values of the response and the predictor variables simulates the null hypothesized condition that there is no linear relationship between the two variables.

The permutation sampling method described above is part of the process for generating the null distribution:

  1. Generate \(n_{iter}\) permutation samples, where \(n_{iter}\) is the number of iterations. We ideally use at least 1,000 iterations in order to construct a distribution that is close to the theoretical null distribution defined in Section 5.9.
  2. Fit the linear model to each of the \(n_{iter}\) permutation samples to obtain \(n_{iter}\) values of \(\hat{\beta}_1\), the estimated slope. There will also be \(n_{iter}\) estimates of \(\beta_0\), but we will ignore those for now since we are focusing on inference for the slope.
  3. Collect the \(n_{iter}\) values \(\hat{\beta}_1\) from the previous step to obtain the simulated null distribution. This is an approximation of the distribution of values \(\hat{\beta}_1\) takes if we repeatedly take samples the same size as the original data and fit the linear model to each sample, assuming the true value of the population slope \(\beta_1\) equals 0. We use this distribution to understand the sample-to-sample variability we expect in \(\hat{\beta}_1\) under the null hypothesis.

Let’s look at an example and generate the the null distribution to test the hypotheses in Equation 5.5.

  1. First we generate 1000 permutation samples, such that in each sample, we permute the values of spend, randomly pairing each to a value of playgrounds. This is to simulate the scenario in which there is no linear relationship between spend and playgrounds.
  2. Next, we fit a linear model to each of the 1000 permutation samples. This gives us 1000 estimates of the slope and intercept. Table ?tbl-slr-permutate-data shows the estimated slope of spend from the first 10 permutation samples.
Table 5.6: Estimated slopes from first 10 permutation samples.
replicate term estimate
1 spend -0.0002980
2 spend 0.0005629
3 spend 0.0012841
4 spend -0.0015942
5 spend -0.0021873
6 spend 0.0036558
7 spend -0.0024923
8 spend -0.0017146
9 spend -0.0012171
10 spend 0.0014289

We can already see from ?tbl-slr-permutate-data first 10 permutation samples that many of the estimated coefficients are close to 0. This is because the permutation test assumes there is no linear relationship between spend and playgrounds while conducting the hypothesis test.

  1. Next, we collect from the previous step to obtain the null distribution. We will use this distribution to assess the strength of the evidence from the original sample data against the null hypothesis.

    Figure 5.5: Null distribution generated from permutation sampling
Table 5.7: Null distribution generated from permutation sampling
Min Q1 Median Q3 Max Mean Std.Dev.
-0.005 -0.001 0 -0.001 0.006 0 0.002

Note that the distribution visualized in Figure 5.5 and summarized in Table 5.7 is approximately unimodal, symmetric, and looks similar to the Normal distribution (also known as the Gaussian distribution). As the number of iterations (permutation samples) increases, the null distribution will get closer and closer to the theoretical null distribution.

You may also notice that the center of the distribution is approximately 0, the null hypothesized value. The standard error of this distribution 0.002 is an estimate of the standard error of \(\hat{\beta}_1\), the sample-to-sample variability in the estimates of \(\hat{\beta}_1\) when taking random samples of size 97 the same size as the original data.

Your turn!
  1. What is the center of the null distribution shown in Figure 5.5 and Table 5.7 ? Is this what you expected? Why or why not?6
  2. How does the estimated variability in the simulated null distribution in Table 5.7 compare to the variability in the bootstrapped distribution in Table 5.4? Is this what you expected? Why or why not?7

5.7.3 Calculate p-value

The null distribution gives us an idea of the values we expect the coefficient of spend if we repeatedly take random samples of cities and fit a linear regression model, all under the null hypothesized condition. To evaluate the strength of evidence against the null hypothesis, we will compare the estimated slope in Table 5.1, \(\hat{\beta}_1 =\) 0.003 to what we would expect \(\hat{\beta}_1\) to be based on the null distribution.

This comparison is quantified using a p-value. The p-value is the probability of observing values at least as extreme as the ones observed from the data, given the null hypothesis is true. In other words, this is the probability of observing values of the slope that are at least as extreme as \(\hat{\beta}_1 =\) 0.003 in the null distribution.

In the context of statistical inference, the phrase “more extreme” means the area between the estimated value ( \(\hat{\beta}_1\) in our case), and the outer tail(s) of the distribution. The alternative hypothesis determines which tail(s) to include when calculating the p-value.

  • If \(H_a: \beta_1 > 0\), the p-value is the probability of obtaining a value in the null distribution that is greater than or equal to \(\hat{\beta}_1\).

  • If \(H_a: \beta_1 < 0\), the p-value is the probability of obtaining a value in the null distribution that is less than or equal to \(\hat{\beta}_1\).

  • If \(H_a: \beta_1 \neq 0\), the p-value is the probability of obtaining a value in the null distribution whose absolute value is greater than or equal to \(\hat{\beta}_1\) . This includes values that are greater than or equal to \(|\hat{\beta}_1|\) or less than or equal to \(-|\hat{\beta}_1|\).

The p-value calculated by most statistical software is the two-sided p-value. Additionally we defined the alternative hypothesis in Section Section 5.7.1 as a two-sided alternative. Therefore, we will calculate the p-value for \(H_a: \beta_1 \neq 0\). As illustrated in Figure 5.6, this p-value is the probability of observing the slope that is 0.003 or more extreme, given the null hypothesis is true. In this case, it is the probability of observing a value in the null distribution that is greater than or equal to |0.003| or a value that is less than or equal to -|0.003|.

The p-value for this hypothesis test is 0.046 and is shown by the dark shaded area in Figure 5.6.

Figure 5.6
Your turn!

Using the definition, interpret the p-value 0.046 in the context of the data.8

Caution

When using permutation tests, you may calculate a p-value of 0, as we did in this example. Note that the true theoretical p-value is not exactly 0; it is just so small that we did not observe a slope at least as extreme as the slope estimated from the sample data in the 1000 permutation samples used to generate the null distribution.

We will calculate the exact p-value in Section 5.9.3 when we conduct hypothesis testing using mathematical models.

5.7.4 Draw conclusion

Recall that we conduct hypothesis tests under the assumption that the null hypothesis is true and we assess the strength of evidence against the null. The p-value is a measure of the strength of that evidence. Therefore, we use the p-value to draw one of the following conclusions:

  • If the p-value is “sufficiently small”, there is strong evidence against the null hypothesis. Therefore we reject the null hypothesis, \(H_0\), and conclude the alternative \(H_a\).
  • If the pvalue is not “sufficiently small”, there is not strong enough evidence against the null hypothesis, so we fail to reject the null hypothesis, \(H_0\), and stick with the null hypothesis.

We use a decision-making threshold called an \(\mathbf{\alpha}\)-level to determine if a p-value is sufficiently small enough to reject the null hypothesis.

  • If \(p-value < \alpha\), then reject \(H_0\)

  • If \(p-value \geq \alpha\), then fail to reject \(H_0\).

A commonly used threshold is \(\alpha = 0.05\). If stronger evidence is required to reject the null hypothesis , then a lower threshold \(\alpha \leq 0.05\) could be used to make a conclusion. If such strong evidence is not required (this may be the case of analyses with very small sample sizes), then a threshold \(\alpha > 0.05\) maybe used. It is convention to not use a threshold greater than 0.1, so any p-value 0.1 is considered large enough to fail to reject the null hypothesis.

Back to our analysis. Let’s use the common threshold of \(\alpha = 0.05\). The p-value we calculated is 0.046, which is less than 0.05. Therefore, we reject the null hypothesis \(H_0\). The data provide sufficient evidence of a linear relationship between the amount a ciy spends per resident and the number of playgrounds per 10,000 residents.

5.7.5 Type I and Type II error

Note that in either decision, we have not determined that the null or alternative hypothesis are truth. We have just determined that the evidence (i.e., the data) has provided more evidence in favor of one conclusion versus the other. As with any statistical procedure, there is the possibility of making an error - more specifically a Type I or Type II error. Because we don’t know the value of the population slope, we will not know for certain whether we have made an error; however, understanding the type of error that could potentially be made helps us make a more informed assessment about the implication of these results in practice.

Table 5.8 shows how these errors correspond to the truth an conclusions drawn from the hypothesis test.

Table 5.8: Illustration of Type I and Type II error
Truth
Hypothesis test decision \(H_0\) true \(H_a\) true
Fail to reject \(H_0\) Correct decision Type II error
Reject \(H_0\) Type I error Correct decision

A Type I error has occurred if the null hypothesis is actually true, but the p-value is small enough that we conclude to reject the null hypothesis. The probability of making this type of error is the decision-making threshold \(\alpha\), because we reject the null hypothesis when we obtain a p-value less than \(\alpha\).

A Type II error has occurred if the alternative hypothesis is actually true, but we fail to reject the null hypothesis. The probability of making this type of error is less straightforward. It is \(1 - Power\) , where the \(Power = P(\text{reject }H_0 | H_a \text{ true})\).

In the context on our data , a Type I error is concluding that there is a statistically significant relationship between spending per resident and playgrounds per 10,000 residents in the model, when there actually isn’t one. A Type II error is concluding there is not a statistically significant relationship between spending per resident and playgrounds per 10,000 residents when in fact there is.

Your turn!

Given the conclusion we reached about the relationship between spending on residents and number of playgrounds per 10,000 residents, is it possible we’ve made a Type I or Type II error?9

5.8 Relationship between CI and hypothesis test

We have described the two main types of inferential procedures: hypothesis testing and confidence intervals. At this point you may be wondering whether there is any connection between the two. Spoiler alert: there is!

Conducting the hypothesis test with the two-sided alternative \(H_a: \beta_1 \neq 0\) with the decision-making threshold \(\alpha\) is equivalent to assessing the \(C\%\) confidence interval, where \(C = (1 - \alpha)\times100\). This means we can also use confidence intervals to conduct two-sided hypothesis tests. When using a confidence interval to draw conclusions about a claim, we use the following guide:

  • If the null hypothesized value ( \(0\) in our case based on the tests defined in Section 5.7.1 ) within the range of the confidence interval, fail to reject \(H_0\) at the \(\alpha\) - level.

  • If the null hypothesized value is not within the range of the confidence interval, we conclude to reject \(H_0\) at the \(\alpha\)- level.

This illustrates the power of confidence intervals; they can not only be used to make a conclusion to reject or fail to reject \(H_0\), but they also give an estimate of the range of values the parameter of interest plausibly takes. Therefore, it is good practice to always report the confidence interval, even if the outcome of the hypothesis test is the primary interest, to provide more context about the results from the analysis beyond the binary reject/fail to reject decision.

Statistical vs. practical significance

When we reject a null hypothesis, we conclude that there is a statistically significant linear relationship between the response and predictor variables. Just because we conclude that there is a statistically significant relationship between the response and predictor, however, does not mean that the relationship is practically significant. The practical significance, how meaningful the results are in practice, is determined by the magnitude of the estimated effect of the predictor on the response and what an effect of that magnitude means in the context of the data and analysis question.

5.9 Inference based on the Central Limit Theorem

Thus far we have approached inference using simulation-based methods (bootstrapping and permutation) to generate sampling distributions to understand the sample-to-sample variability in \(\hat{\beta}_1\). When certain conditions are met, however, we can use theoretical results about the sampling distribution and hence the variability in estimated slope. In this section, we present that theory, then use it to conduct statistical inference for the slope \(\beta_1\). One thing you’ll notice as we go through this section is that the inferential procedures and conclusions are very similar as before. The primary difference, then, is in how the estimate of the sampling variability in \(\hat{\beta}_1\) is obtained.

5.9.1 Central Limit Theorem

The Central Limit Theorem (CLT) is a foundational theorem in statistics about the distribution of a statistic and the associated mathematical properties of that distribution. For the purposes of this text, we will focus on what the Central Limit Theorem tells us about the distribution of an estimated slope \(\hat{\beta}_1\), but note that this theorem applies to statistics other than the slope. We will also focus on the results of the theorem and less so on derivations or advanced mathematical details of the Central Limit Theorem.

By the Central Limit Theorem, we know under certain conditions (more on these conditions in the next chapter) \[\hat{\beta}_1 \sim N(\beta_1, SE_{\hat{\beta}_1}) \tag{5.6}\]

By Equation 5.6, we know that \(\hat{\beta}_1\) is (1) Normally distributed, (2) with a mean at the true slope \(\beta_1\), and (3) a standard error of \(SE_{\hat{\beta}_1}\). The center of this distribution \(\beta_1\) is unknown; the purpose of statistical inference is to draw conclusions about this parameter. The standard error of this distribution \(SE_{\hat{\beta}_1}\) is

\[ SE_{\hat{\beta}_1} = \hat{\sigma}_{\epsilon}\sqrt{\frac{1}{(n-1)s_x^2}} \tag{5.7}\]

where \(n\) is the number of observations, \(s_X^2\) is the variance of the predictor variable \(X\), and \(\hat{\sigma}_{\epsilon}\) is the regression standard error introduced in Section 5.9.2. The details of how this formula is derived is beyond the scope of this text, but let’s take a moment to think about what we know about the sampling variability of \(\hat{\beta}_1\) from Equation 5.7.

The regression standard error \(\hat{\sigma}_\epsilon\) is in the numerator, indicating would expect more variability in \(\hat{\beta}_1\) when there is more variability about the regression line. The variability in the predictor variable \(s_X^2\) is in the denominator, indicating we expect more variability in \(\hat{\beta}_1\) when there is less variability in the values of the predictor. Therefore, \(SE_{\hat{\beta}_1}\) is the balance between the variability about the regression line and the variability in the predictor variable itself.

As you will see in the following sections, we will use this estimate of the sampling variability in the estimated slope to conduct hypothesis testing and calculate confidence intervals, ultimately drawing conclusions about the true relationship between the response and predictor variables.

5.9.2 Estimating \(\sigma_{\epsilon}\)

As discussed in Section 4.3.1, there are three parameters that need to be estimated for simple linear regression \(\beta_0\), \(\beta_1\), and \(\sigma_{\epsilon}\). Section 4.4 introduced least squares regression, a method for calculating \(\hat{\beta}_0\) and \(\hat{\beta}_1\) the estimates for intercept and slope, respectively. Now, we turn to the estimation of \(\sigma_\epsilon\), also known as the regression standard error.

By obtaining the estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\), we have the equation of the regression line and therefore can estimate the mean value \(Y\) given a value of the predictor \(X\). We can use the residuals (observed-predicted), then, to estimate the variability of the distribution of the response about the line, which is the estimated mean of the conditional distribution \(Y|X\). The regression standard error is the estimate of the variability as it is the estimated standard deviation of the distribution of the residuals about the line. The equation for the regression standard error is shown in ?eq-slr-reg-se.10

\[ \hat{\sigma}_{\epsilon} = \sqrt{\frac{\sum_{i = 1}^ne_i^2}{n - 2}} = \sqrt{\frac{\sum_{i = 1}^n(y_i - \hat{y}_i)^2}{n - 2}} \]

Why divide by \(n-2\)?

You may have noticed that the denominator in ?eq-slr-reg-se is \(n - 2\) not \(n\) or \(n-1\). The value \(n - 2\) is called the degrees of freedom for simple linear regression. The degrees of freedom are how many observations are available to understand variability about the line. We need two observations to make a line. Therefore any additional observations help us understand variability about the line. We will talk more about how we use degrees of freedom for inference based on the Central Limit Theorem in Section 5.9 .

Recall that the standard deviation is the average distance between each observation and the mean of the distribution. Therefore the regression standard error can be thought of as the average distance the observed value the response is from the regression line.

The regression standard error \(\hat{\sigma}_{\epsilon}\) is used to quantify the sampling variability in the estimated slope \(\hat{\beta}_1\), so we will use this value as we conduct inference on the slope. This will be especially true in Section 5.9 when we introduce inference methods based on the Central Limit Theorem.

5.9.3 Hypothesis test for the slope

The overall goals hypothesis tests for a slope are the same when using methods based on the Central Limit Theorem as they are when using simulation-based methods. We define a null and alternative hypothesis, conduct testing assuming the null hypothesis is true, and draw a conclusion based on an assessment of the evidence against the null hypothesis. The main difference between these two approaches is in how we quantify the variability in \(\hat{\beta}_1\) and thus define the null distribution. In Section 5.7.2 we used permutation sampling to generate the null distribution. We’ll see that by the Central Limit Theorem we have results that specify exactly how the null distribution is defined.

The steps for conducting a hypothesis test based on the Central Limit Theorem are the following:

  1. State the null and alternative hypotheses.
  2. Calculate a test statistic.
  3. Calculate a p-value.
  4. Draw a conclusion.

As in Section 5.7 the goal is to use hypothesis testing to determine whether there is evidence of a statistically significant linear relationship between a response and predictor variable, corresponding to the two-sided alternative hypothesis of “not equal to 0”. Therefore, the null and alternative hypotheses are the same as defined in Equation 5.5 \[ \begin{aligned} H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \\ \end{aligned} \]

The next step is to calculate a test statistic. The general form of a test statistic is

\[ T = \frac{\text{observed } - \text{ hypothesized}}{\text{standard error}} \]

More specifically, in the hypothesis test for \(\beta_1\), the test statistic is

\[ T = \frac{\hat{\beta}_1 - 0}{SE_{\hat{\beta}_1}} \]

To calculate the test statistic, the observed slope is shifted by the mean, and then rescaled by the standard error. Let’s consider what we learn from the test statistic. Recall that by the Central Limit Theorem, the distribution of \(\hat{\beta}_1\) is \(N(\beta_1, SE_{\hat{\beta}_1})\). Because we conduct hypothesis testing under the assumption the null hypothesis is true, we are assuming that the mean of the distribution of \(\hat{\beta}_1\) is 0. Therefore, the observed slope is shifted by the hypothesized mean 0 and rescaled by \(SE_{\hat{\beta}_1}\), defined in Equation 5.7. Similar to a \(z\)-score in the Normal distribution, the test statistic tells us how far the observed slope is from the center of the distribution assumed by the null hypothesis. The magnitude of the test statistic \(|T|\) provides measure of how far the observed slope is from the center of the distribution, and the sign of \(T\) indicates whether the observed slope is above (positive sign) or below (negative sign) the hypothesized mean of 0.

Your turn!

Consider the magnitude of the test statistic, \(|T|\). Do you think test statistics with low magnitude provide evidence in support or against the null hypothesis? What about test statistics with high magnitude?11

Next, we use the test statistic to calculate a p-value and ultimately draw a conclusion about the strength of the evidence against the null hypothesis. The test statistic, \(T\), follows a \(t\) distribution with \(n - p -1\) degrees of freedom, where \(n\) is the number of observations used to fit the model and \(p\) is the number of predictor terms in the model (terms not including the intercept). In simple linear regression, because we only have one term for the predictor in the model, the degrees of freedom will always be \(n-1 - 1 = n - 2\). Because the test statistic is essentially is a standardization of the observed slope under the assumption that the null hypothesis is true, we use the distribution of the test statistic in place of a simulated null distribution from the permutation test to understand how far the observed slope is from what we would expect given the null hypothesis is true.

Because \(SE_{\hat{\beta}_1}\) in the test statistic is calculated using an estimated value, \(\hat{\sigma}_\epsilon\), to calculate the p-value we need a distribution that can handle a bit more variability, as we know that \(\hat{\sigma}_\epsilon\) and \(SE_{\hat{\beta}_1 }\) are more than likely not the exact values in the population. Thus, we use the \(t\) distribution for statistical inference on the slope, as it accounts for this additional variability.

\(t_{df}\) vs. \(N(0,1)\) distribution

Figure 5.7 shows the standard normal distribution \(N(0,1)\) and the \(t\) distribution for different degrees of freedom. The \(t\) distribution is very similar to the standard normal distribution: they are both centered at 0 and have a shape that is unimodal and symmetric. In other words, they both look like “bell curves” with a mean of 0. The difference, then, is that the \(t\) distribution allows for more variability than what is expected in the standard normal distribution. This is also referred to as having “fatter tails”, as the \(t\) distribution has more area under the curve in the tails (or more extreme) values of the distribution. As the degrees of freedom increase, the \(t\) distribution becomes closer to the the standard normal distribution.

Figure 5.7: Standard normal vs. t distributions

As in the case of simulation-based inference, because the alternative hypothesis is “not equal to”, the p-value is calculated on both the high and low extremes of the distribution as shown in Equation 5.8.

\[ \begin{aligned} &p-value = Pr(|t| > |T|) = Pr(t < -|T| \text{ or } t > |T|) \\[5pt] &\text{where } t \sim t_{n-2} \end{aligned} \tag{5.8}\]

As with simulation-based inference, we compare the p-value to a decision-making threshold \(\alpha\) to draw final conclusions. If \(p-value < \alpha\) , we reject the null hypothesis and conclude the alternative. Otherwise, we fail to reject the null hypothesis.

Let’s apply this process to test whether there is a linear relationship between spending per resident and the number of playgrounds per 10,000 residents. The null and and alternative hypotheses are

\[ \begin{aligned} &H_0: \beta_{1} = 0 \\ &H_a: \beta_{1} \neq 0 \end{aligned} \]

where \(\beta_1\) is the true slope between spend and playgrounds. From Table 5.1, we know the observed slope \(\hat{\beta}_{1}\) is 0.0033 and the estimated standard error \(SE_{\hat{\beta}_{1}}\) is 0.0016. The test statistic is

\[ T = \frac{0.0033 - 0}{0.0016} = 2.063 \]

This test statistic means that given the true slope is spend in this model is 0 and thus the mean of the distribution of \(\hat{\beta}_{1}\) is 0, the observed slope of 0.003 is 2.063 standard errors above the mean. It’s difficult to determine whether or not this is really “far enough” away from the center of the distribution, but we can calculate a p-value to determine the probability of observing a slope at least this far given the null hypothesis is true.

Given there are \(n\) = 97 observations and \(p\) = 1 predictor terms in the model, the test statistic follows a \(t\) distribution with $97 - 1 - 1 = $ 95 degrees of freedom. The p-value, is \(Pr(t < - |2.063| \text{ or } t > |2.063|) =\) 0.042.

Using a decision-making threshold \(\alpha = 0.05\), the p-value \(0.042\) is sufficiently small, so we reject the null hypothesis. The data provide sufficient evidence that the coefficient of spend is not 0 in this model and that there is a statistically significant linear relationship between spending per resident and playgrounds per 10,000 residents in U.S. cities.

Note that this conclusion is the same as in Section 5.7.4 using a simulation-based approach. This is what we would expect, given these are the two different approaches for conducting the same process - hypothesis testing for the slope. We are also conducting the tests under the same assumptions that the null hypothesis is true. The difference is in the methods available - simulation versus mathematical models - to quantify \(SE_{\hat{\beta}_1}\), the sampling variability in the estimated slopes \(\hat{\beta}_1\).

5.9.4 Confidence interval

As with simulation-based inference, a confidence interval calculated based on the results from the Central Limit Theorem is an estimated range of values the parameter \(\beta_1\) plausibly takes. As with hypothesis testing, the purpose, interpretation, and conclusions drawn from confidence intervals are the same as before. What differs, however, is how the interval is calculated. In the simulation-based approach, we used bootstrapping to construct a sampling distribution to understand sample-to-sample variability in \(\beta_1\). By the Central Limit Theorem, we know exactly how to quantify the sample-to-sample variability in \(\hat{\beta}_1\) using mathematical results.

The equation for a \((C \times 100) \%\) confidence interval for \(\beta_1\) is

\[ \hat{\beta}_1 \pm t^* \times SE_{\hat{\beta}_1} \]

where \(t^* \sim t_{n - 2}\).

From earlier sections, we know about the estimated slope \(\hat{\beta}_1\) and the standard error \(SE_{\hat{\beta}_1}\). Therefore, let’s focus on \(t^*\), known as the critical value.

The critical value is the point on the \(t_{n-2}\) distribution such that the probability of being between \(-t^*\) and \(t^*\) is \(C\). Thinking about this visually, this is the point such that the area under the curve between \(-t^*\) and \(t^*\) is \(C\). Note that we are still using a \(t\) distribution with \(n - 2\) degrees of freedom, the same distribution used to calculate the p-value in hypothesis testing. The critical value can be calculated from modern statistical software or using online apps.

Let’s calculate the 95% confidence interval for \(\beta_{1}\) . The critical value on the \(t_{95}\) distribution is 1.985. You may notice that this value is close to the critical value of 1.96 on the standard normal distribution. Therefore, the 95% confidence interval is

\[ \begin{aligned} 0.0033 \pm 1.985 \times 0.0016 \\ 0.0033 \pm 0.0032 \\ [0.0001, 0.0065] \end{aligned} \]

The interpretation is the same as before: We are 95% confident that the interval 0.0001 to 0.0065 contains the true slope for spend. This means we are 95% confident that for each additional dollar increase in spending per resident, the number of playgrounds per 10,000 residents is greater between 0.0001 and 0.0065, on average.

5.10 Summary

In this chapter we introduced two approaches for conducting statistical inference and drawing conclusions about a population slope - simulation based methods and methods based on the Central Limit Theorem. You may have noticed that the standard error, test statistic, p-value and confidence interval we calculated using the mathematical models from the Central Limit Theorem align with what it seen from the output produced by statistical software in Table 5.1. Modern statistical software will produce these values for you, so in practice you will not typically derive these values “manually” as we did in this chapter. As the data scientist your role will be to interpret the output and use it to draw conclusions. It’s still valuable, however, to have an understanding of where these values come from in order to interpret and use them accurately.

We have these two approaches for inference, so which one do we use for a given analysis? In the next chapter, we will discuss the model assumptions from Section 5.3 in more detail along with conditions to check if the assumptions hold for a given data set. We will use these conditions in conjunction with other statistical and practical considerations to determine when we might prefer simulation-based methods or those based on the Central Limit Theorem.


  1. We sample with replacement so that we get a new sample each time we bootstrap. If we sampled without replacement, we would always end up with a sample that looks exactly like our sample data.↩︎

  2. There are 97 observations in each bootstrap sample. Each bootstrap sample is the same size as the sample data.↩︎

  3. There are 1000 values, the number of iterations, in the bootstrapped sampling distribution.↩︎

  4. The points at the \(5^{th}\) and \(95^{th}\) percentiles make the bounds for the 95% confidence interval.↩︎

  5. The points at the \(1^{st}\) and \(99^{th}\) percentiles mark the lower and upper bounds for a 98% confidence interval.↩︎

  6. The center of the distribution is approximately 0. This is about what we expect, given the null hypothesized value is 0.↩︎

  7. The variability is approximately equal in both distributions.↩︎

  8. Given there is no linear relationship between spending per resident and playgrounds per 10,000 residents ( \(H_0\) is true), the probability of observing a slope of 0.003 or more extreme in a random sample of 97 cities is 0.046 .↩︎

  9. It is possible we have made a Type I error, since we rejected the null hypothesis.↩︎

  10. Note its similarities to the general equation for sample standard deviation, \(s = \sqrt{\frac{\sum_{i=1}^n(y_i - \bar{y})^2}{n-1}}\)↩︎

  11. Test statistics with low magnitude provide evidence in support of the null hypothesis, as they are close to the hypothesized value. Conversely test statistics with high magnitude provide evidence against the null hypothesis, as they are very far away from the hypothesized value.↩︎