5 Inference for simple linear regression
This chapter is a work in progress.
Learning goals
- Describe how statistical inference is used to draw conclusions about a population slope
- Construct confidence intervals using bootstrap simulation
- Conduct hypothesis tests using permutation
- Describe how the Central Limit Theorem is applied to inference for the slope
- Conduct statistical inference on the slope using mathematical models based on the Central Limit Theorem
- Interpret results from statistical inference in the context of the data
- Understand the connection between hypothesis tests and confidence intervals
R Packages
library(tidyverse)
(Wickham et al. 2019)library(patchwork)
(Pedersen 2022)library(skimr)
(Waring et al. 2022)library(broom)
(Robinson, Hayes, and Couch 2023)
5.1 Introduction: Access to playgrounds
The Trust for Public Land is a non-profit organization that advocates for equitable access to outdoor spaces in cities across the United States. In the 2021 report Parks and an Equitable Recovery (The Trust for Public Land 2021), the organization stated that “parks are not just a nicety—they are a necessity”. The report details the many health, social, and environmental benefits of having ample access to public outdoor space in cities along with the various factors that impede the access to parks and other outdoor space for some residents.
One type of outdoor space the authors study in their report is playgrounds. The report describes playgrounds as one type of outdoor space that “bring children and adults together” (The Trust for Public Land 2021, 13) and a place that was important for distributing “fresh food and prepared meals to those in need, particularly school-aged children” (The Trust for Public Land 2021, 9) during the global COVID-19 pandemic.
Given the impact of playgrounds for both children and adults in a community, we will focus on understanding variability in the access to playgrounds in this chapter. In particular, we want to (1) investigate whether local government spending is useful in understanding variability in playground access, and if so, (2) quantify the true relationship between local government spending and playground access.
The data in parks.csv
includes information on 97 of the most populated cities in the United States (US) in the year 2020. The data were originally collected by the Trust for Public Land and was a featured as part of the TidyTuesday weekly data visualization challenge. The analysis in this chapter will focus on two variables:
per_capita_expend
: Total amount the city spends per resident in 2020 (in US dollars). This is referred to as a city’s “per capita expenditure”.playgrounds
: Number of playgrounds per 10,000 residents in 2020
Which of the following do you to be true about the relationship between per capita expenditure and the number of playgrounds per 10,000 residents?
- The relationship is positive.
- The relationship is negative.
- There is no relationship.
5.1.1 Exploratory data analysis
The visualizations and summary statistics for univariate and bivariate exploratory data analysis are in Figure 5.1 and ?tbl-parks-unviariate-eda .
The distribution of playgrounds per 10,000 residents (the response variable) is unimodal and right-skewed. The center of the distribution is the median of about 2.6 playgrounds per 10,000 residents, and the the spread of the middle 50% of the distribution (the IQR) is 1.7. There appear to be two potential outlying cities with more than 6 playgrounds per 10,000 residents, indicating high playground access relative to the other cities in the data set.
The distribution of city expenditures (the predictor variable) is also unimodal and right-skewed. The center of the distribution is around 89 dollars per resident, and the middle 50% of the distribution has a spread of about 77 dollars per resident. Similar to the response variable, there are some potential outliers. There are 5 cities with spending greater than 300 dollars per resident.
From Figure 5.2 there appears to be a positive relationship between a city’s per capita expenditure and the number of playgrounds per 10,000 residents. The correlation is 0.206, indicating the relationship between playground access and city expenditure is not very strong. This is partially influenced by the outlying observations in which there is relatively low city per capita expenditure but high numbers of playgrounds per 10,000 residents.
Linear regression model
To better explore this relationship, we fit a simple linear regression model of the form
The model can be fit using the following R code:
<- lm(playgrounds ~ per_capita_expend, data = parks)
parks_model
<- tidy(parks_model)
parks_model_tidy
|>
parks_model_tidy kable(digits = 3)
The estimated regression equation from the output in Table 5.2 is
Interpret the slope and intercept in the context of the data.
Interpret the slope in terms of an additional $100 in per capita expenditure. Which interpretation is more meaningful in practice?
Does the intercept have a meaningful interpretation? 1
From the sample of 97 cities in 2020, the estimated slope of 0.003. This estimated slope is likely close to but not the exact value of the true population slope that we would obtain if we had data on every city in the US. Based on the equation alone, we are also not sure if this slope indicates an actual meaningful relationship between the two variables, or this slope is due to random variability in the data. Therefore, we will use statistical inference methods to help answer these questions and use the model to draw conclusions beyond these 97 cities.
5.2 Objectives of statistical inference
Based on the regression output in Table 5.2, for each additional dollar in per capita expenditure, the number of playgrounds per 10,000 residents is expected to be greater by 0.003, on average.
The estimate 0.003 is the “best guest” of the relationship between per capita expenditure and the number of playgrounds per 10,000 residents; however, this is likely not the exact value of the relationship in the population of all US cities. Therefore, we will use statistical inference, the process of drawing conclusions about the population parameters based on the analysis of the sample data.
There are two different types of statistical inference procedures:
- Hypothesis tests: Test a specific claim about the population parameter
- Confidence intervals: A range of values that may reasonably contain the value of the population slope.
In this chapter focuses on statistical inference for the slope
As we’ll see throughout the chapter, a key component of statistical inference is quantifying the sampling variability, sample-to-sample variability in the statistic that is the “best guest” estimate for the parameter. For example, when we conduct statistical inference on the slope of per capita expenditure
It is not feasible to collect a lot of new samples for this process. Instead, there are two approaches to obtain the sampling distribution, quantify the variability in the estimated slopes, and conduct statistical inference.
Simulation-based methods: Quantifying the sampling variability by generating a sampling distribution directly from the sample data
Theory-based methods: Quantify the sampling variability using mathematical models based on the Central Limit Theorem
Section 5.4 and Section 5.6 introduce statistical inference using simulation-based methods, and Section 5.8 introduces inference using theory-based methods. Before we get into those details, however, let’s introduce more of the foundational ideas underlying simple linear regression and how they relate to statistical inference.
Statistical inference: Using sample data to draw conclusions about a population.
Population parameter: Quantitative measure of a feature of the population (for example, the slope for the population)
Sample statistic: Estimate of the population parameter using the sample data (also called the point estimate)
Sampling distribution: Distribution of the sample statistic across many samples.
Sampling variability: Variability in the sampling distribution. This can be measured using the standard deviation, variance, IQR, etc.
5.3 Foundations of simple linear regression
In Section 4.3.1, we introduced the statistical model for simple linear regression
such that
We conduct simple linear regression assuming Equation 5.4 is true. Equation 5.4 is the assumed distribution of the response variable conditional on the predictor variable under the simple linear regression model. Based on the equation we specify the assumptions that are made when we do simple linear regressions.
The distribution of the response
is normal for a given value of the predictor .The mean of the distribution of
given is . There is a linear relationship between the response and predictor variable.The variance of the distribution of
given is . This variance does not depend on . It is equal for all values of .The error terms
in Equation 5.3 are independent of one another. This also means the observations are independent of one another.
Whenever we fit linear regression models and conduct inference on the slope, we do so under the assumption that some or all of these four statements hold. In Chapter 6, we will discuss how to check if these assumptions hold in a given analysis. As we might expect, these assumptions do not always perfectly hold in practice, so we will also discuss circumstances in which each assumption is necessary versus when some can be relaxed. For the remainder of this chapter, however, we will proceed as if all these assumptions hold.
(section title)
5.4 Bootstrap confidence intervals
A confidence interval is a range of values that may reasonably contain the value of the population slope. This range is computed from the sample data. By computing this range, we are more likely to capture the value of the true population parameter than if we only used the estimated sample statistic (called a point estimate). We will focus on confidence intervals for
In order to obtain this range of values we must understand the sampling variability of the statistic. Suppose we repeatedly take samples of size
We can compute
Why do we sample with replacement when getting the bootstrap sample? What would the boot strap sample looks like sampling is done without replacement?2
5.4.1 Constructing a bootstrap confidence interval for
A bootstrap confidence interval for the population slope,
- Generate
bootstrap samples, where is the number of iterations. We typically want to use at least 1000 iterations in order to construct a sampling distribution that is close to the theoretical distribution of defined in Section 5.8. - Fit the linear model to each of the
bootstrap samples to obtain values of , the estimated slope. There will also be values of the estimated intercept, , but we will ignore those for now because we are not focusing on inference for the intercept. - Collect the
values from the previous step to obtain the bootstrapped sampling distribution. It is an approximation of the sampling distribution of , and thus provides information about the to sampling distribution of . - Use the distribution from the previous step to calculate the
confidence interval. The lower and upper bounds of the interval are the points distribution that mark the middle of the distribution.
Using these four steps, we can construct the 95% confidence for the population slope
- Generate 1000 bootstrap samples (97 observations in each sample) by sampling with replacement from the current sample data of 97 observations. The first 10 observations from the first bootstrapped sample are shown in Table 5.3.
Why are there 97 observations in each bootstrap sample?3
Next, we fit a linear model of the form in Equation 5.1 to each of the 1000 bootstrap samples. The estimated slopes and intercepts for the first three bootstrap samples are shown in Table 5.4.
We are focused on inference for the slope of
per_capita_expend
, so we collect estimated slopes ofper_capita_expend
and make the bootstrap distribution. This is the approximation of the sampling distribution of . A histogram and summary statistics for this distribution are shown in Figure 5.3 and Table 5.5, respectively.
How many values of
- As the final step, we use the bootstrap distribution to calculate the lower and upper bounds of the 95% confidence interval. These bounds are calculated as the points that mark off the middle 95% of the distribution. These are the points that the
and percentiles as shown by the vertical lines in Figure 5.4.
The 95% bootstrapped confidence interval for per_capita_expend
is 0.001 to 0.007.
5.4.2 Interpreting the interval
The basic interpretation for the 95% confidence interval for per_capita_expend
is
We are 95% confident that the interval 0.001 to 0.007 contains the population slope for per capita expenditure in the model of the relationship between city per capita expenditure and number of playgrounds per 10,000 residents.
Though this interpretation indicates the range of values that may reasonably contain the true population slope for per_capita_expend
, it still requires the reader to further interpret what it means about the relationship between per capita expenditure and playgrounds per 10,000 residents. It is more informative to interpret the confidence interval in a way that also utilizes the interpretation of the slope from Section 5.1.1 , so it is clear to the reader exactly what the confidence interval means. Thus, a more complete and informative interpretation of the confidence interval is as follows:
We are 95% confident that for each additional dollar a city spends per resident, the number of playgrounds per 10,000 residents is greater by 0.001 to 0.007, on average.
This interpretation not only indicates the range of values as before, but it also clearly describes what this range means in terms of the average change in playgrounds per 10,000 residents as per capita expenditure increases.
5.4.3 What does “confidence” mean?
The beginning of the interpretation for a confidence interval is “We are
In reality we don’t know the value of the population slope (if we did, we wouldn’t need statistical inference!), so we can’t definitively conclude if the interval constructed in Section 5.4.1 is one of the
Thus far, we have used a confidence interval to produce a plausible range of values for the population slope. We can also test specific claims about the population slope using another inferential procedure called hypothesis testing.
5.4.4 Bootstrap confidence intervals in R
The bootstrap distribution and confidence interval are computed using the infer package (Couch et al. 2021). Because bootstrapping is a random sampling process, the code should include set.seed()
to ensure the results are reproducible. Any integer value can go inside the set.seed
function.
- Set a seed to make the results reproducible.
- Define the number of bootstrap samples (iterations). Bootstrapping can be computing intensive when using large data sets and a large number of iterations. We recommend using a small number of iterations ( 10 - 100) when testing code, then increasing the iterations once the code is finalized.
- Specify the data set and save the bootstrap distribution in the object
boot_dist
. - Specify the response and predictor variable.
- Specify the type of simulation (“bootstrap”) and the number of iterations.
- For each bootstrap sample, fit the linear regression model.
We can use ggplot
to make a histogram of the bootstrap distribution.
|>
boot_dist filter(term == "per_capita_expend") |>
ggplot(aes(x = estimate)) +
geom_histogram() +
labs(x = "Estimated coefficient",
title = "Bootstrapped sampling distribution of slope")
Finally, we can compute the lower and upper bounds for the confidence interval using the quantile
function. Note that the code includes ungroup()
, so that the data are not grouped by replicate
.
# A tibble: 1 × 2
lb ub
<dbl> <dbl>
1 0.000735 0.00711
5.5 Hypothesis testing
Hypothesis testing is used to evaluate a claim about about a population parameter. The claim could be based on previous research, an idea a research or business team wants to explore, or a general statement about the parameter. We can use the process detailed in this section to test a claim about the intercept
Before getting into the details of simulation-based hypothesis testing, we’ll illustrate the steps for a hypothesis test based on a common used analogy, the general procedure of a court trial in the United States (US) judicial system.
Define the hypotheses
The first step to any hypothesis test (or trial) is to define the hypotheses to be evaluated. These hypotheses are called the null and alternative. The null hypothesis
In the US judicial system, a defendant is deemed innocent unless proven otherwise. Therefore, the null and alternative hypotheses are
Evaluate the evidence
The primary component of trial (or hypothesis test) is a presenting and evaluating the evidence. In a trial, this is the point when the evidence is presented and it is evaluated under the assumption the null hypothesis (defendant is not guilty) is true. Thus, the lens in which the evidence is being evaluated is “given the defendant is not guilty, how likely is it that this evidence would exist?”
For example, suppose an individual is on trial for a robbery at a jewelry store. The null hypothesis is that they are not guilty and did not rob the jewelry store. The alternative hypothesis is they are guilty and did rob the jewelry store. If there is evidence that the person was in a different city during the time of the jewelry store robbery, the evidence would be more in support of the null hypothesis of innocence. It seems plausible the individual could have been in a different city at the time of the robbery if the null hypothesis is true. Alternatively, if some of the missing jewelry was found in the individual’s car, the evidence would seem to be strongly in support of the alternative hypothesis. If the null hypothesis is true, it does not seem likely that the individual would have the missing jewelry in their car.
In hypothesis testing, the “evidence” being assessed is the analysis of the sample data. Thus we are considering the question “given the null hypothesis is true, how likely is it to observe the results seen in the sample data?” We will introduce approaches to address this question using simulation-based methods in Section 5.6 and theory-based methods in Section 5.8.3.
Make a conclusion
There are two typical conclusions in a trial in the US judicial system - the defendant is guilty or not guilty based on the evidence. The criteria to conclude the alternative that a defendant is guilty is that the strength of evidence must be “beyond reasonable doubt”. If there is sufficiently strong evidence against the null hypothesis of not guilty, then the conclusion is the alternative hypothesis that the defendant is guilty. Otherwise, the conclusion is that the defendant is not guilty, indicating the evidence against the null was not strong enough to otherwise refute it. Note that this is the not the same as “accepting” the null hypothesis but rather indicating that there wasn’t enough evidence to suggest otherwise.
Similarly in hypothesis testing, we will use a predetermined threshold to assess if the evidence against the null hypothesis is strong enough to reject the null hypothesis and conclude the alternative, or if there is not enough evidence “beyond a reasonable doubt” to draw a conclusion other than the assumed null hypothesis.
5.6 Permutation tests
Now that we have explained the general process of hypothesis testing, let’s take a look at hypothesis testing using a simulation-based approach, called a permutation test.
The four steps of permutation test for a slope
- State the null and alternative hypotheses.
- Generate the null distribution.
- Calculate the p-value.
- Draw a conclusion.
These steps are described in detail in the context of the hypothesis test for the slope in Equation 5.1.
5.6.1 State the hypotheses
As defined in Section 5.5 the null hypothesis (
- Null hypothesis: There is no linear relationship between playgrounds per 10,000 residents and per capita expenditure. The slope of
per_capita_expend
is equal to 0. - Alternative hypothesis: There is a linear relationship between playgrounds per 10,000 residents and per capita expenditure. The coefficient of
per_capita_expend
is not equal to 0.- Note that we have not hypothesized whether the slope is positive or negative.
The hypotheses are defined specifically in terms of the linear relationship between the two variables, because we are ultimately drawing conclusions about the slope
Suppose there is a response variable
The hypotheses for testing whether there is truly a linear relationship between
5.6.1.1 One vs. two-sided hypotheses
The alternative hypothesis defined in Note 5.1 and ?eq-slr-hypotheses is “not equal to 0”. This is the alternative hypothesis corresponding to a two-sided hypothesis test, because it includes in the scenarios in which
A one-sided hypothesis test imposes some information about the direction of the parameter, that is positive (
A two-sided hypothesis test makes no assumption about the direction of the relationship between the response variable and predictor being tested. Therefore, it is a good starting point for drawing conclusions about the relationship between a given response and predictor variable. From the two-sided hypothesis, we will conclude whether there is or is not sufficient statistical evidence of a linear relationship between the response and predictor. With this conclusion, we cannot determine if the relationship between the variables is positive or negative without additional analysis. We can use a confidence interval (Section 5.4) to make specific conclusions about the direction (and magnitude) of the relationship.
5.6.2 Generate the null distribution
Recall that hypothesis tests are conducted assuming the null hypothesis
To assess the evidence, we will use a simulation-based method to approximate the sampling distribution of the sample slope
In permutation sampling the values of the predictor variable are randomly shuffled and paired with values of the response, thus generating a new sample of the same size as the original data. The process of randomly pairing the values of the response and the predictor variables simulates the null hypothesized condition that there is no linear relationship between the two variables.
The steps for generating the null distribution using permutation sampling are the following:
- Generate
permutation samples, where is the number of iterations. We ideally use at least 1,000 iterations in order to construct a distribution that is close to the theoretical null distribution defined in Section 5.8. - Fit the linear model to each of the
permutation samples to obtain values of , the estimated slope. There will also be values of the estimated intercepts; we will ignore those for now because we are focused on inference for the slope. - Collect the
values of from the previous step to make the simulated null distribution. This is an approximation of the distribution of values if we were to repeatedly take samples the same size as the original data and fit the linear model to each sample, under the assumption that the null hypothesis is true.
Let’s look at an example and generate the the null distribution to test the hypotheses in ?eq-slr-hypotheses.
- First we generate 1000 permutation samples, such that in each sample, we permute the values of
per_capita_expend
, randomly pairing each to a value ofplaygrounds
. This is to simulate the scenario in which there is no linear relationship betweenper_capita_expend
andplaygrounds
. The first 10 rows of the first permutation sample are in Table 5.7.
- Next, we fit a linear regression model to each of the 1000 permutation samples. This gives us 1000 estimates of the slope and intercept. The first 10 rows of the first permutation sample are in Table 5.8 .
replicate | term | estimate |
---|---|---|
1 | per_capita_expend | -0.0002980 |
2 | per_capita_expend | 0.0005629 |
3 | per_capita_expend | 0.0012841 |
4 | per_capita_expend | -0.0015942 |
5 | per_capita_expend | -0.0021873 |
6 | per_capita_expend | 0.0036558 |
7 | per_capita_expend | -0.0024923 |
8 | per_capita_expend | -0.0017146 |
9 | per_capita_expend | -0.0012171 |
10 | per_capita_expend | 0.0014289 |
Next, we collect the estimated slopes from the previous step to obtain the simulated null distribution. We will use this distribution to assess the strength of the evidence from the original sample data against the null hypothesis.
Note that the distribution visualized in Figure 5.5 and summarized in Table 5.9 is approximately unimodal, symmetric, and looks similar to the normal distribution. As the number of iterations (permutation samples) increases, the simulated null distribution will be closer and closer to a normal distribution.
You may also notice that the center of the distribution is approximately 0, the null hypothesized value. The standard error of this distribution 0.002 is an estimate of the standard error of
5.6.3 Calculate p-value
We use the null distribution to understand the values we expect per_capita_expend
, to take if we repeatedly take random samples and fit a linear regression model, assuming the null hypothesis
This comparison is quantified using a p-value. The p-value is the probability of observing estimated slopes at least as extreme as the value estimated from the sample data, given the null hypothesis is true. In the context of the parks data, the p-value is the probability of observing values of the slope that are at least as extreme as
In the context of statistical inference, the phrase “more extreme” means the area between the estimated value (
If
, the p-value is the probability of obtaining a value in the null distribution that is greater than or equal to .If
, the p-value is the probability of obtaining a value in the null distribution that is less than or equal to .If
, the p-value is the probability of obtaining a value in the null distribution whose absolute value is greater than or equal to . This includes values that are greater than or equal to or less than or equal to .
Recall from Section 5.6.1 that we are testing a two-sided alternative hypothesis. Therefore, we will calculate the p-value corresponding to the alternative hypothesis
The p-value for this hypothesis test is 0.046 and is shown by the dark shaded area in Figure 5.6.
Use the definition of the p-value at the beginning of this section to interpret the p-value of 0.046 in the context of the data.8
5.6.4 Permutation test in R
The null distribution and p-value are computed using the infer package (Couch et al. 2021). Much of the code to generate the null distribution is similar to the code to conduct bootstrap sampling. Because permutation sampling is a random process, the code should include set.seed()
to ensure the results are reproducible. Any integer value can go inside the set.seed
function.
- Set a seed to make the results reproducible.
- Define the number of bootstrap samples (iterations). Permutation sampling can be computing intensive when using large data sets and a large number of iterations. We recommend using a small number of iterations ( 10 - 100) when testing code, then increasing the iterations once the code is finalized.
- Specify the data set and save the null distribution in the object
null_dist
. - Specify the response and predictor variable.
- Specify the null hypothesis of “independence”, corresponding to no linear relationship between the response and predictor variables.
- Specify the type of simulation (“permute”) and the number of iterations.
- For each permutation sample, fit the linear regression model.
We can use ggplot
to make a histogram of the null distribution.
|>
null_dist filter(term == "per_capita_expend") |>
ggplot(aes(x = estimate)) +
geom_histogram() +
labs(x = "Estimated coefficient",
title = "Simulated null distribution of slope")
Finally, we can compute the the p-value using the get_p_value
function.
# A tibble: 1 × 2
term p_value
<chr> <dbl>
1 per_capita_expend 0.054
When using permutation tests, the software may output a p-value of 0, as in this example. Note that the true theoretical p-value is not exactly 0; it is just so small that we did not observe a slope at least as extreme as the slope estimated from the sample data in the 1000 permutation samples used to generate the null distribution.
We will calculate the exact p-value in Section 5.8.3 when we conduct hypothesis testing using theory-based inference methods.
5.6.5 Draw conclusion
Recall that the goal is to evaluate the strength of evidence against the null hypothesis. The p-value is a measure of the strength of that evidence and is used to draw one of the following conclusions:
- If the p-value is “sufficiently small”, there is strong evidence against the null hypothesis. We reject the null hypothesis,
, and conclude the alternative . - If the p-value is not “sufficiently small”, there is not strong enough evidence against the null hypothesis. We fail to reject the null hypothesis,
, and stay with the null hypothesis.
We use a predetermined decision-making threshold called an
If
, then rejectIf
, then fail to reject .
A commonly used threshold is
Back to parks analysis. We will use the common threshold of
5.6.6 Type I and Type II error
Regardless of the conclusion that is drawn (reject or fail to reject the null hypothesis), we have not determined that the null or alternative hypothesis are definite truth. We have just concluded that the evidence (the data) has provided more evidence in favor of one conclusion versus the other. As with any statistical procedure, there is the possibility of making an error - more specifically a Type I or Type II error. Because we don’t know the value of the population slope, we will not know for certain whether we have made an error; however, understanding the potential errors that can be made can help inform the decision-making threshold
Table 5.10 shows how Type I and Type II errors correspond to the (unknown) truth and the conclusion drawn from the hypothesis test.
Truth | |||
---|---|---|---|
Hypothesis test decision | Fail to reject |
Correct decision | Type II error |
Reject |
Type I error | Correct decision |
A Type I error has occurred if the null hypothesis is actually true, but the p-value is small enough to reject the null hypothesis. The probability of making this type of error is the decision-making threshold
A Type II error has occurred if the alternative hypothesis is actually true, but we fail to reject the null hypothesis, because the p-value is large. Computing the probability of making this type of error is less straightforward. It is calculated as
In the context of the parks data , a Type I error is concluding that there is a linear relationship between per capita expenditure and playgrounds per 10,000 residents in the model, when there actually isn’t one in the population. A Type II error is concluding there is no linear relationship between per capita expenditure and playgrounds per 10,000 residents when in fact there is.
Given the conclusion in Section 5.6.5, is it possible we’ve made a Type I or Type II error?9
5.7 Relationship between CI and hypothesis test
We have described the two main types of inferential procedures: hypothesis testing and confidence intervals. At this point you may be wondering whether there is any connection between the two. Spoiler alert: there is!
Testing a claim with the two-sided alternative
If the null hypothesized value (
based on the tests defined in Section 5.6.1 ) is within the range of the confidence interval, fail to reject at the - level.If the null hypothesized value is not within the range of the confidence interval, reject
at the - level.
This illustrates the power of confidence intervals; they can not only be used to draw a conclusion about a claim (reject or fail to reject
When we reject a null hypothesis, we conclude that there is a statistically significant linear relationship between the response and predictor variables. Concluding there is statistically significant relationship between the response and predictor, however, does not necessarily mean that the relationship is practically significant. The practical significance, how meaningful the results are in the real world, is determined by the magnitude of the estimated slope of the predictor on the response and what an effect of that magnitude means in the context of the data and analysis question.
5.8 Theory-based inference
Thus far we have approached inference using simulation-based methods (bootstrapping and permutation) to generate sampling distributions and null distributions. When certain conditions are met, however, we can use theoretical results about the sampling distribution to understand the variability in
5.8.1 Central Limit Theorem
The Central Limit Theorem (CLT) is a foundational theorem in statistics about the distribution of a statistic and the associated mathematical properties of that distribution. For the purposes of this text, we will focus on what the Central Limit Theorem says the distribution of an estimated slope
By the Central Limit Theorem, we know under certain conditions (more on these conditions in the Chapter 6 .)
Equation 5.5 means that by the Central Limit Theorem, we know that the sampling distribution of
where
The regression standard error
As you will see in the following sections, we will use this estimated of the sampling variability in the estimated slope to draw conclusions about the true relationship between the response and predictor variables based on hypothesis testing and confidence intervals.
5.8.2 Estimating
As discussed in Section 4.3.1, there are three parameters that need to be estimated for simple linear regression
By obtaining the estimates
You may have noticed that the denominator in Equation 5.7 is
Recall that the standard deviation is the average distance between each observation and the mean of the distribution. Therefore the regression standard error can be thought of as the average distance the observed value the response is from the regression line. The regression standard error
5.8.3 Hypothesis test for the slope
The overall goals of hypothesis tests for a population slope are the same when using theory-based methods as previously described in Section 5.5. We define a null and alternative hypothesis, conduct testing assuming the null hypothesis is true, and draw a conclusion based on an evaluation of the strength of evidence against the null hypothesis. The main difference from the simulation-based approach in Section 5.6 is in how we quantify the variability in
The steps for conducting a hypothesis test based on the Central Limit Theorem are the following:
- State the null and alternative hypotheses.
- Calculate a test statistic.
- Calculate a p-value.
- Draw a conclusion.
As in Section 5.6 the goal is to use hypothesis testing to determine whether there is evidence of a statistically significant linear relationship between a response and predictor variable, corresponding to the two-sided alternative hypothesis of “not equal to 0”. Therefore, the null and alternative hypotheses are the same as defined in ?eq-slr-hypotheses
The next step is to calculate a test statistic. Similar to a
More specifically, in the hypothesis test for
To calculate the test statistic, the estimated slope is shifted by the mean, and then rescaled by the standard error. Let’s consider what we learn from the test statistic. Recall that by the Central Limit Theorem, the distribution of
The test statistic is calculated by shifting the observed slope by the hypothesized mean 0 and rescaling it by
Consider the magnitude of the test statistic,
Next, we use the test statistic to calculate a p-value and we will ultimately use the p-value to draw a conclusion about the strength of the evidence against the null hypothesis, as before. The test statistic,
Though the sampling distribution of
Figure 5.7 shows the standard normal distribution
As explained in Section 5.6.3, because the alternative hypothesis is “not equal to”, the p-value is calculated on both the high and low extremes of the distribution as shown in Equation 5.8.
We compare the p-value to a decision-making threshold
Now let’s apply this process to test whether there is evidence of a linear relationship between per capita expenditure and the number of playgrounds per 10,000 residents. As before, the null and and alternative hypotheses are
where per_capita_expend
and playgrounds
. From Table 5.2, we know the observed slope
This test statistic means that given the true slope is per_capita_expend in this model is 0 and thus the mean of the distribution of
Given there are
Using a decision-making threshold
Note that this conclusion is the same as in Section 5.6.5 using a simulation-based approach. This is what we would expect, given these are the two different approaches for conducting the same process - hypothesis testing for the slope. We are also conducting the tests under the same assumptions that the null hypothesis is true. The difference is in the methods available - simulation versus theory-based - to quantify
5.8.4 Confidence interval
As with simulation-based inference, a confidence interval calculated based on the results from the Central Limit Theorem is an estimated range of the values that reasonably contain
The equation for a
where
From earlier sections, we know about the estimated slope
The critical value is the point on the
Let’s calculate the 95% confidence interval for
The interpretation is the same as before: We are 95% confident that the interval 0.0001 to 0.0065 contains the true slope for per_capita_expend
. This means we are 95% confident that for each additional dollar increase in per capita expenditure, the number of playgrounds per 10,000 residents is greater between 0.0001 and 0.0065, on average.
The output from the lm
function contains the statistics discussed in this section to conduct theory-based inference. The p-value in the output corresponds to the two-sided alternative hypothesis
The confidence interval does not display by default, but can be added using the conf.int
argument in the tidy
function. The default confidence level is 95%; it can be adjusted using the conf.level
argument in tidy()
.
<- lm(playgrounds ~ per_capita_expend, data = parks)
parks_fit
tidy(parks_fit, conf.int = TRUE, conf.level= 0.95)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.42 0.214 11.3 3.11e-19 1.99 2.84
2 per_capita_expend 0.00330 0.00160 2.06 4.24e- 2 0.000115 0.00648
5.9 Summary
In this chapter we introduced two approaches for conducting statistical inference and drawing conclusions about a population slope - simulation based methods and methods based on the Central Limit Theorem. You may have noticed that the standard error, test statistic, p-value and confidence interval we calculated using the mathematical models from the Central Limit Theorem align with what it seen from the output produced by statistical software in Table 5.2. Modern statistical software will produce these values for you, so in practice you will not typically derive these values “manually” as we did in this chapter. As the data scientist your role will be to interpret the output and use it to draw conclusions. It’s still valuable, however, to have an understanding of where these values come from in order to interpret and use them accurately.
We have these two approaches for inference, so which one do we use for a given analysis? In the next chapter, we will discuss the model assumptions from Section 5.3 in more detail along with conditions to check if the assumptions hold for a given data set. We will use these conditions in conjunction with other statistical and practical considerations to determine when we might prefer simulation-based methods or those based on the Central Limit Theorem.
Slope: For each additional dollar a city spends per resident, the number of playgrounds per 10,000 residents is expected to increase by 0.003.
Intercept: The expected number of playgrounds per 10,000 residents for a city that spends $0 on its residents is 2.418. We would not expect a city to spend $0 on its residents, so this interpretation is not meaningful in practice.↩︎We sample with replacement so that we get a new sample each time we bootstrap. If we sampled without replacement, we would always end up with a sample that looks exactly like our existing sample data.↩︎
Each bootstrap sample is the same size as our current sample data. In this case, the sample data we’re analyzing has 97 observations.↩︎
There are 1000 values, the number of iterations, in the bootstrapped sampling distribution.↩︎
The points at the
and percentiles make the bounds for the 95% confidence interval.↩︎The points at the
and percentiles mark the lower and upper bounds for a 98% confidence interval.↩︎The variability is approximately equal in both distributions.↩︎
Given there is no linear relationship between spending per resident and playgrounds per 10,000 residents (
is true), the probability of observing a slope of 0.003 or more extreme in a random sample of 97 cities is 0.046 .↩︎It is possible we have made a Type I error, since we rejected the null hypothesis.↩︎
Standard error is the term used for the standard deviation of a sampling distribution.↩︎
Note its similarities to the general equation for sample standard deviation,
↩︎Test statistics with small magnitude provide evidence in support of the null hypothesis, as they are close to the hypothesized value. Conversely test statistics with large magnitude provide evidence against the null hypothesis, as they are very far away from the hypothesized value.↩︎