8  Inference for multiple linear regression

8.1 Notes:

8.2 Learning outcomes

8.3 R Packages

8.4 Introduction: Palmer penguins data

  • Introduce the data set

  • Include exploratory data analysis

  • Include regression model

penguins_m1 <- linear_reg() |>
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm + sex + species, 
      data = penguins)

penguins_m1_tidy <- tidy(penguins_m1)
penguins_m1_tidy |>
  kable(digits = 3)
flipper_length_slope <- penguins_m1_tidy |> filter (term == "flipper_length_mm") |> select(estimate)

flipper_length_se <- penguins_m1_tidy |> filter (term == "flipper_length_mm") |> select(std.error)
Table 8.1: Multiple linear regression model for body mass of Palmer penguins
term estimate std.error statistic p.value
(Intercept) -759.064 541.377 -1.402 0.162
flipper_length_mm 17.847 2.902 6.150 0.000
bill_length_mm 21.633 7.148 3.027 0.003
sexmale 465.395 43.081 10.803 0.000
speciesChinstrap -291.711 81.502 -3.579 0.000
speciesGentoo 707.028 94.359 7.493 0.000

8.5 Objectives of statistical inference

In Table 8.1 we see the output of the regression model for which flipper length, bill length, sex, and species are used to explain variability in the body mass of Palmer penguins. As with simple linear regression , we can use the coefficient estimates to describe how the body mass is expected to change as each predictor variable changes. For example, based on this model, for each additional millimeter in the flipper length, we expect a Palmer penguin’s body mass to increase by 17.847 grams, on average, holding bill length, sex, and species constant.

The estimate 17.847 is our “best guest” of the relationship between flipper length and body mass after adjusting for bill length, sex, and species; however, this is not the exact value of the relationship in the population of all Palmer penguins. Therefore, as we did in simple linear regression, we will conduct statistical inference in which we draw conclusions about the population parameters based on the analysis of the sample data. There are two different types of statistical inference procedures:

  • Hypothesis tests: A test about a specific claim about the population parameter

  • Confidence intervals: A plausible range of values the population parameter can take

In this chapter we will discuss how to conduct each of these inferential procedures, what conclusions can be drawn from each, and how they are related to one another.

As we’ll see throughout the chapter, a key part of statistical inference is understanding the sample-to-sample variability in the statistic estimating that parameter. For example, if we are doing statistical inference on the coefficient of flipper length \(\beta_{flipper\_length}\), we will need to understand the variability in the statistic \(\hat{\beta}_{flipper\_length}\) , i.e., how much variability in \(\hat{\beta}_{flipper\_length}\) would we expect if we repeatedly took samples of size nrow(penguins) and fit a model using bill length, flipper length, sex, and species to predict body mass. Thus, there are two approaches to statistical inference that are distinguished by how they get at this sample-to-sample variability:

  • Simulation-based methods: Use the sample data to generate the distribution of the sample statistic

  • Central Limit Theorem - based methods: Use mathematical models based results from the Central Limit Theorem to quantify the variability in the sample statistic

We will describe how to conduct hypothesis testing and construct confidence intervals using each approach and how we might choose when to use one approach over the other.

8.6 Simulation-based inference

8.6.1 Introduction

  • Using simulation to get the variability in the estimated slope \(\hat{\beta}_j\)

  • Sketch of what happens when we do simulation-based inference

8.6.2 Bootstrap confidence intervals

The goal of confidence intervals is to calculate a range of values in which we think the population parameter plausible takes given what we’ve observed in the data and our statistical methods. By calculating this range, we will more certainly capture the true value of the population parameter than if we merely rely on a single point estimate. .

As previously mentioned, in order to obtain this range of values we must first get an understanding of the variability in the statistic if we were to repeatedly take samples of size \(n\), the size of our sample data, and fit regression models. In practice, it is not feasible to repeatedly collect enough samples to generate an approximation of the sampling distribution to understand this variability, so we will use our sample data to generate these new samples. We generate these samples using bootstrapping, in which we sample with replacement such that each sample is size \(n\), the size of our sample data.

Your turn!

Why do we sample with replacement when doing bootstrapping? What would happen if we sampled without replacement?1

8.6.2.1 Constructing the bootstrap confidence interval

Using the bootstrap resampling method described above, we will use the following steps to construct a bootstrap confidence interval:

  1. Generate \(n_{reps}\) bootstrap samples, where \(n_{reps}\) is the number of iterations. We typically want to use at least 1000 iterations in order to construct a sampling distribution that is close to the theoretical distribution (discussed in a later section).
  2. Fit the linear model to each of the \(n_{reps}\) bootstrap samples. This will give you \(n_{reps}\) estimates of each model coefficient.
  3. Pull all the estimates from the previous step together to obtain a distribution. The distribution of the \(n_{reps}\) estimates for a given coefficient \(\hat{\beta}_j\) is its bootstrapped sampling distribution. This is an approximation of the distribution of values \(\hat{\beta}_j\) takes if we repeatedly take samples the same size as the original data and fit the same linear model to each sample. We use this distribution to understand the sample-to-sample variability we expect in \(\hat{\beta}_j\).
  4. Use the distribution from the previous step to calculate the \(C\%\) confidence interval. The lower and upper bounds are calculated as the points on the distribution that mark the middle \(C\%\) of the distribution.

Let’s demonstrate these four steps by calculating a 95% confidence interval for the \(\beta_{flipper\_length}\).

  1. We construct 1000 bootstrap samples by sampling with replacement from our sample of 333 observations. We see the first 10 rows of this data below. These are the first 10 observations from the first bootstrap sample.
replicate body_mass_g flipper_length_mm bill_length_mm sex species
1 3475 184 36.6 female Adelie
1 3550 186 39.0 female Adelie
1 4400 214 45.7 female Gentoo
1 5200 217 46.5 female Gentoo
1 5800 230 48.6 male Gentoo
1 4550 210 46.5 female Gentoo
1 3250 187 51.5 male Chinstrap
1 2900 178 33.1 female Adelie
1 3200 189 34.6 female Adelie
1 5000 221 46.4 male Gentoo
Your turn!

How many observations are in each bootstrap sample?2

  1. Next, we fit a linear model of the form in [XX-ref model above] to each of the 1000 bootstrap samples. We see estimated coefficients for the first two bootstrap samples.

    set.seed(12345)
    
    nreps = 1000
    boot_dist <- penguins |>
      specify(body_mass_g ~ flipper_length_mm + bill_length_mm + sex + species) |>
      generate(reps = nreps, type = "bootstrap") |>
      fit()
    
    boot_dist |>
      filter(replicate %in% c(1,2)) |>
      kable() 
    replicate term estimate
    1 intercept -1216.23193
    1 flipper_length_mm 19.13292
    1 bill_length_mm 27.50777
    1 sexmale 494.78695
    1 speciesChinstrap -372.20470
    1 speciesGentoo 558.59970
    2 intercept -1487.52982
    2 flipper_length_mm 22.26410
    2 bill_length_mm 17.48488
    2 sexmale 498.08752
    2 speciesChinstrap -288.14897
    2 speciesGentoo 645.83130
    1. We are focused on inference for the coefficient of flipper length, so we pull just the estimated coefficients of flipper length and construct the bootstrap distribution. This is the bootstrapped sampling distribution of \(\hat{\beta}_{flipper\_length}\) . A histogram and summary statistics for this distribution are shown below.
    flipper_length_dist <- boot_dist |>
      filter(term == "flipper_length_mm")|>
      ungroup()
    
    flipper_length_dist |>
      ggplot(aes(x = estimate)) + 
      geom_histogram(binwidth = 2, color = "white", fill = "steelblue") +
      labs(x = "Estimated coefficient", 
           title = "Bootstrapped sampling distribution of flipper length", 
           subtitle = paste0("from ",nreps, " bootstrap samples")) +
      theme_bw()

    flipper_length_dist |>
      summarise(Min = min(estimate), Q1 = quantile(estimate, 0.25), Median = median(estimate), Q3 = quantile(estimate, 0.25), Max = max(estimate), Mean = mean(estimate), Std.Dev. = sd(estimate)) |>
      kable(digits = 3)
    Min Q1 Median Q3 Max Mean Std.Dev.
    9.427 16.09 17.839 16.09 26.387 17.877 2.726
Your turn!

How many values are in the bootstrapped sampling distribution?3

As the final step, we use the bootstrapped sampling distribution constructed in the previous step to calculate the lower and upper bounds of the 95% confidence interval. These bounds are calculated as the points that mark off the middle 95% of the distribution. These are the points that the \(2.5^{th}\) and \(97.5^{th}\) percentiles as shown in the graph below.

lb <- quantile(flipper_length_dist$estimate, 0.025)
ub <- quantile(flipper_length_dist$estimate, 0.975)

boot_dist |>
  filter(term == "flipper_length_mm") |>
  ggplot(aes(x = estimate)) + 
  geom_histogram(binwidth = 2, color = "white", fill = "steelblue") +
  geom_vline(xintercept = lb, type = 2, color = "red") + 
  geom_vline(xintercept = ub, type = 2, color = "red") +
  labs(x = "Estimated coefficient", 
       title = "Bootstrapped sampling distribution of flipper length", 
       subtitle = "with 95% confidence interval") +
  theme_bw()

# add annotation to include the actual values on the graph

The 95% bootstrapped confidence interval for \(\hat{\beta}_{flipper\_length}\) is 12.773 to 23.683.

Your turn!

What two points in the distribution mark the lower and upper bounds for a

  • 90% confidence interval?4

  • 98% confidence interval?5

8.6.2.2 Interpreting the interval

The interpretation of the bootstrapped confidence interval for a coefficient in multiple linear regression is very similar to the interpretation in the case of simple linear regression. Using the confidence interval we constructed for \(\beta_{flipper\_length}\) , we have the following basic interpretation:

We are 95% confident that the interval 12.773 to 23.683 contains the population coefficient for flipper length in the model that also bill length, sex, and species.

Though this interpretation provides some information about the range of values the coefficient of flipper length takes, it still requires the reader to recall what that means about the relationship between flipper length and body mass in this model. Therefore, we can interpret the confidence interval in a way that also utilizes the basic interpretation of the slope, so it is clear to the reader exactly what the confidence interval means. Thus, a more complete and informative interpretation of the confidence interval is as follows:

We are 95% confident that for each additional millimeter in a Palmer penguin’s flipper length, it’s body mass is greater by 12.773 to 23.683 grams, on average, holding bill length, sex, and species constant.

This interpretation not only indicates the values we estimate the population coefficient takes but also clearly describes what this means in terms of the variability of body mass as flipper length changes.

8.6.2.3 What does confidence mean?

The beginning of the interpretation for a confidence interval is “We are C% confident…”. Here we will briefly discuss what is meant by “C% confident”. The “confidence” mentioned here is in the statistical method we’re using, in this case constructing the bootstrapped confidence interval. This means if we were to replicate our process - obtain a sample of 333 Palmer penguins, construct a bootstrapped sampling distribution, calculate the bounds of the interval - many, many times, the intervals we calculate would include the population coefficient C% of the time.

In reality we don’t know the value of the population parameter (if we did, we wouldn’t need statistical inference!), so we’re not sure if the interval we constructed is one of the C% that actually contains the population parameter. Therefore, we aren’t certain that our interval contains the population parameter, but we can same with some level of confidence, C% in fact, that we think it does.

8.6.3 Randomization tests for the slope

[Transition from confidence intervals to hypothesis test]

When we do hypothesis testing, we test a claim about a population parameter. The claim could be based on previous research, a claim that a research or business team wants to assess, or a general statement about the parameter. For now we will focus on conducting hypothesis tests for a slope \(\hat{\beta}_j\). Similar to a confidence interval, we can use the process detailed in this section to test a claim about the intercept; however, this is often not meaningful in practice.

Before digging into the details of the simulation-based hypothesis test, let’s talk about what happens conceptually when we do a hypothesis test. We’ll use the general format of a trial in the United States court system to help us understand.

Define the hypotheses
The first step to any trial (or hypothesis test) is to define the hypotheses to be evaluated. There are two hypotheses, the null and alternative. The alternative hypothesis is defined by the claim being tested, and the null hypothesis is the baseline, or assumed, condition. In the U.S. court system, a defendant is deemed innocent unless proven otherwise. Therefore, the claim the prosecutor is trying to prove is that the defendant is guilty. This the alternative hypothesis. The null hypothesis in this scenario, then, is that the defendant is innocent or or not guilty. In the U.S. court system, we say that a person is “innocent until proven guilty beyond a reasonable doubt.” Therefore, as the trial (or hypothesis test) proceeds, we will assume the null hypothesis is true and evaluate how strong the evidence is against the null.

Evaluate the evidence

The primary largest part of a trial is an evaluation of the evidence. This is when each side presents its data and it is up to the judge and jury to evaluate the evidence under the assumption the null hypothesis, the defendant is innocent, is true. Thus the lens in which the evidence is being evaluated is “given the defendant is innocent, what are the chances this evidence would exist?” So for example, if there is evidence that the defendant was in a different city during the time of the jewelry store robbery, that would be evidence more in favor of the null hypothesis. Otherwise, if some of the stolen jewelry store items were found in the defendant’s car, that would evidence more in favor of the alternative hypothesis.

In hypothesis testing, the evidence being assessed is the analysis of the sample data. Thus the evaluation question being asked is “given the null hypothesis is true, what are the chances of observing the analysis results that we see in the data?” We will make use of simulation-based methods along with mathematical models based on the Central Limit Theorem in order to answer this question.

Make a conclusion

In general there are two potential conclusions from the trial in the U.S. court system - the judge/jury decides that the defendant is guilty or not guilty. As the criteria in court is that a person the evidence must be “beyond reasonable doubt”, therefore that is the bar that the judge and jury uses in order to make their conclusion. If there is sufficiently strong evidence against the null hypothesis of innocence, then they conclude the alternative hypothesis that the defendant is guilty. Otherwise, they conclude that the defendant is not guilty, indicating the evidence against the null was not strong enough to otherwise refute it. Note that this is the not the same as “accepting” the null hypothesis but rather stating that there wasn’t enough evidence to suggest otherwise. This is why a defendant is deemed “not guilty” rather than “innocent” if there is evidence beyond a reasonable doubt.

Similarly in hypothesis testing, we will use statistical criteria to determine if the evidence against the null hypothesis is strong enough to reject this hypothesis in favor of the alternative, or if there is not enough evidence “beyond a reasonable doubt” to conclude anything other than the assumed null hypothesized claim.

Now that we’ve explored the overarching idea beyond hypothesis testing, let’s take a look at hypothesis testing using a simulation-based approach.

The four steps of a randomization test for a slope coefficient \(\beta_j\) are

  1. State the hypotheses.
  2. Generate the null distribution.
  3. Calculate the p-value.
  4. Draw a conclusion.

We’ll discuss each of these steps in detail.

8.6.3.1 Generate hypotheses

In a previous section we defined the null and alternative hypotheses as the following: the null hypothesis (\(H_0\)) is assumed condition, and the alternative hypothesis (\(H_1\)) is determined by the claim being tested. Now let’s consider what this means specifically in terms of a hypothesis test for the coefficient of \(\beta_{flipper\_length}\) for model EQ?

  • Null hypothesis: The coefficient of flipper length is 0, after adjusting for bill length, sex, and species. This means that there is no linear relationship between body mass and flipper length, even after adjusting for bill length, sex, and species.
  • Alternative hypothesis: The coefficient of flipper length is not equal to 0, after adjusting for bill length, sex, and species. This mean there is a linear relationship between body mass and flipper length, even after adjusting for bill length, sex, and species.

In mathematical notation, these hypotheses are

\[ \begin{aligned} &H_0: \beta_{flipper\_length} = 0, \text{ after adjusting for bill length, sex, and species} \hspace{5mm} vs. \\ &H_a: \beta_{flipper\_length} \neq 0, \text{ after adjusting for bill length, sex, and species} \end{aligned} \]

General statement of hypotheses

Suppose you have the following multiple linear regression model

\[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip} + \epsilon_i, \hspace{5mm} \epsilon_i \sim N(0, \sigma^2_\epsilon) \]

Then, the hypotheses to if \(\beta_j\) is a useful predictor of \(y\) after adjusting for the other variables in the model are

\[\begin{aligned}&H_0: \beta_j = 0, \text{ after adjusting for all other variables in the model}\\ &H_a: \beta_j \neq 0, \text{ after adjusting for all other variables in the model}\end{aligned}\]

8.6.4 One vs. two-sided hypotheses

In [REF CALLOUT BOX], we defined the alternative hypothesis as \(\beta_j\) “not equal to 0”. There are two other options for defining the alternative hypothesis, \(\beta_j\) “less than 0” (\(H_a: \beta_j < 0\)) and \(\beta_j\) “greater than 0” (\(H_a: \beta_j > 0\)), with both cases being under the same conditions as earlier “after adjusting for all other variables in the model.” These are the alternative hypotheses for conducting a one-sided hypothesis test and the hypothesis in [REF CALLOUT BOX] is the alternative for conducting a two-sided hypothesis test.

A one-sided hypothesis test imposes some information about the direction of the parameter, e.g., that is positive (\(> 0\)) or negative ( \(< 0\)). Given this information imposed by the direction of the alternative hypothesis, it requires less evidence to reject the null hypothesis in favor of the alternative . Therefore, such a one-sided hypothesis should only be used if (1) there is indication from previous knowledge or research that the relationship between the response variable and the predictor being tested is in a particular direction, or (2) only one direction of the relationship between the response and predictor being tested is relevant for the research. Outside of these two scenarios, it is not advisable to use the one-sided hypothesis, as there could appear to be a statistically significant relationship between the two variables merely by chance of how the hypotheses were constructed.

A two-sided hypothesis test makes no assumption about the direction of the relationship between the response variable and predictor being tested. Therefore, it is a good starting to draw conclusions about the relationships between the response and predictor variables. The conclusion from a two-sided hypothesis test will be that there is or is not sufficient evidence against the null hypothesis. Even if we reject the null and conclude there is a statistically significant relationship, however, we cannot conclude that the relationship is positive or negative just based on the two-sided hypothesis test. We will see in Section [REF] that we can draw conclusions about the direction of the relationship from the confidence interval.

8.6.4.1 Generate the null distribution

Recall that similar to the United States judicial system’s notion of “innocent until proven guilty [beyond a reasonable doubt”, we assume the null hypothesis is true while conducting hypothesis testing and assess the strength of the evidence against the null hypothesis. Therefore, based on the hypotheses defined in Section [REF], when we do a hypothesis test for the slope \(\beta_j\), we do so assuming that \(\beta_j = 0\), in other words that there is no linear relationship between \(x_j\) and the response variable \(y\) after adjusting for the other variables in the model.

Similar to how we used bootstrapping to generate a distribution to understand sampling variable in \(\hat{\beta}_j\), we need will use a simulation-based method called permutation to generate the distribution of \(\beta_j\)’s under the assumption that the null hypothesis is true. This distribution is called the null distribution.

When we conduct permutation sampling, at each iteration we will randomly assign values of \(x_j\) to each observation in the data. Note that we will preserve the original data for the remaining predictors and the response. Because the values are randomly assigned, this simulates the null hypothesized condition that there is no linear relationship between the response and the predictor \(x_j\).

Using the permutation sampling method described above, we will use the following steps to construct a null distribution:

  1. Generate \(n_{reps}\) permutation samples, where \(n_{reps}\) is the number of iterations. We typically want to use at least 1000 iterations in order to construct a distribution that is close to the theoretical null distribution (discussed in a later section).
  2. Fit the linear model to each of the \(n_{reps}\) permutation samples. This will give you \(n_{reps}\) estimates of each model coefficient.
  3. Pull all the estimates from the previous step together to obtain a distribution. The distribution of the \(n_{reps}\) estimates for a given coefficient \(\hat{\beta}_j\) is its null distribution. This is an approximation of the distribution of values \(\hat{\beta}_j\) takes if we repeatedly take samples the same size as the original data and fit the linear model to each sample, assuming the null hypothesis is true. We use this distribution to understand the sample-to-sample variability we expect in \(\hat{\beta}_j\) under the null hypothesis.

Let’s look at an example and construct the the null distribution to test the hypotheses in [REF].

  1. First we generate 1000 permutation samples, such that in each sample, we permute (or shuffle) the values of flipper_length_mm, to simulate the scenario in which there is no association between flipper_length_mm and body_mass_g.
  2. Next, we fit a linear model to each of the 1000 permutation samples. This gives us 1000 estimates of each model coefficient. Below we can see the estimates of flipper_length_mm from the first 10 permutation samples.
set.seed(12345)
nreps = 1000

null_dist <- penguins |>
  specify(body_mass_g ~ flipper_length_mm + bill_length_mm + sex + species) |>
  hypothesize(null = "independence") |>
  generate(reps = nreps, type = "permute", variables = c(flipper_length_mm)) |>
  fit()

null_dist |>
  ungroup() |>
  filter(term == "flipper_length_mm") |>
  slice(1:10) |>
  kable()
replicate term estimate
1 flipper_length_mm -0.6588419
2 flipper_length_mm -1.3965712
3 flipper_length_mm -0.3703079
4 flipper_length_mm -0.7361458
5 flipper_length_mm -1.6793636
6 flipper_length_mm -0.5430295
7 flipper_length_mm -1.3449740
8 flipper_length_mm 1.0025968
9 flipper_length_mm -0.0926206
10 flipper_length_mm 0.4248671

We can already see in the first 10 permutation samples that many of the estimated coefficients are close to 0. This is again because we are assuming there is no association between flipper_length_mm and body_mass_g while conducting the hypothesis test.

  1. Next, we pull the estimates from the previous step together to obtain the null distribution. We will use this distribution to assess the strength of the evidence against the null hypothesis.

    flipper_length_null_dist <- null_dist |>
      filter(term == "flipper_length_mm") |>
      ungroup()
    
    flipper_length_null_dist |>
      ggplot(aes(x = estimate)) + 
      geom_histogram(binwidth = 0.25, color = "white", fill = "steelblue") +
      labs(x = "Estimated coefficient", 
           title = "Simulated null distribution of flipper length", 
           subtitle = paste0("from ",nreps, " permutation samples")) +
      theme_bw()

    flipper_length_null_dist |>
      summarise(Min = min(estimate), Q1 = quantile(estimate, 0.25), Median = median(estimate), Q3 = quantile(estimate, 0.25), Max = max(estimate), Mean = mean(estimate), Std.Dev. = sd(estimate)) |>
      kable(digits = 3)
    Min Q1 Median Q3 Max Mean Std.Dev.
    -4.355 -0.797 0.024 -0.797 5.055 0.067 1.231
    se_flipper <- sd(flipper_length_null_dist$estimate)


    Your turn!

    What does each observation in Figure [REF] represent?6

Note that the distribution in Figure [REF] is approximately unimodal and symmetric, and looks similar to the Normal distribution (also known as Gaussian distribution). As the number of iterations, i.e., permutation samples increases, the null distribution will get closer and closer to the null distribution.

You may also notice that the center of the distribution is approximately 0, the null hypothesized value. The standard error of this distribution 1.231 is an estimate of the standard error of \(\hat{\beta}_{flipper\_length\_mm}\), the sample-to-sample variability in the estimates of \(\hat{\beta}_{flipper\_length\_mm}\) when taking random samples of size 333. In a later section we will talk about how to calculate the standard error exactly based on mathematical models.

8.6.4.2 Calculate p-value

The null distribution gives us an idea of the values we expect the coefficient of flipper_length_mm to take across many random samples of penguins. Now we will compare the estimated coefficient obtained from our data to what we would expect under the null condition.

To make this comparison, we will calculate a p-value. The p-value is the probability of observing values at least as extreme as the ones observed from the data, given the null hypothesis is true. In other words, this is the probability of observing values that are at least as extreme as the coefficient estimated from our data in the null distribution.

You may now be wondering, “what does more extreme mean”? The determination of “more extreme” is made from the alternative hypothesis. Below are the possibilities:

  • For \(H_a: \beta_j > 0\), then “more extreme” means obtaining a value in the null distribution that is greater than or equal to \(\hat{\beta}_j\)

  • For \(H_a: \beta_j < 0\), then “more extreme” means obtaining a value in the null distribution that is less than or equal to \(\hat{\beta}_j\)

  • For \(H_a: \beta_j \neq 0\), then “more extreme” means obtaining a value in the null distribution whose absolute value is greater than or equal to \(\hat{\beta}_j\) . This means obtaining a value that is greater than or equal to \(|\hat{\beta}_j|\) or a value that is less than or equal to \(-|\hat{\beta}_j|\).

The p-value calculated by most statistical software is the two-sided p-value. Additionally we defined the alternative hypothesis in Section [REF] as a two-sided alternative. Therefore, we will calculate the p-value for \(H_a: \beta_j \neq 0\). This p-value is calculated as the probability of observing the slope we observed in Output[REF] of 17.847 or more extreme. In this case, it is the probability of observing a value in the null distribution that is greater than or equal to 17.847 or a value that is less than or equal to -17.847.

num_extreme_obs <- flipper_length_null_dist |>
  filter(estimate >= abs(flipper_length_slope) | estimate <= -abs(flipper_length_slope)) |>
  nrow()

pvalue <- num_extreme_obs / nreps
pvalue
[1] 0

The p-value for this hypothesis test, therefore, is 0.

Your turn!

Interpret the p-value 0 in the context of the data.7

The p-value is very small ( \(\approx 0\)), so it is extremely unlikely to observe a slope of the relationship between flipper length and body mass if there is, in fact, no linear relationship.

Caution

When using permutation tests, you may calculate a p-value of 0, as we did in this example. Note that the true theoretical p-value is not exactly 0; it is just so small that we did not observe a slope at least as extreme in 1000 used to generate the null distribution.

We will calculate the exact p-value in Section [REF] when we conduct hypothesis testing using mathematical models.

8.6.4.3 Draw conclusion

Now that we have calculated the p-value, we will use it draw conclusions about the hypothesis. Recall that we conduct hypothesis tests under the assumption that the null hypothesis is true and we assess the strength of evidence against the null. The p-value is a measure of the strength of that evidence. Therefore, we draw the following conclusions from the p-value:

  • If the p-value is “sufficiently small”, there is strong evidence against the null hypothesis. Therefore we reject the null hypothesis, \(H_0\), and conclude the alternative.
  • Otherwise, there is not strong enough evidence against the null hypothesis, so we fail to reject the null hypothesis, \(H_0\), and stick with the null hypothesis.

In the case of our example, our p-value is \(\approx 0\), so it is sufficiently small to reject the null hypothesis and conclude that there is evidence of a linear relationship between the flipper length and body mass of penguins even after adjusting for bill length, sex, and species.

Note that in either decision, we have not determined that the null or alternative hypothesis are truth. We have just determined that the evidence (i.e., the data) has provided more evidence in favor of one conclusion versus the other. As with any statistical procedure, there is the possibility of making an error - more specifically a Type I or Type II error. Before we discuss the different error types, let’s first define what “sufficiently small” means.

We use a decision-making threshold called an \(\alpha\)-level to determine if a p-value is sufficiently small enough to reject the null hypothesis.

  • If \(p-value < \alpha\), then reject \(H_0\)

  • If \(p-value \geq \alpha\), then fail to reject \(H_0\).

A commonly used threshold is \(\alpha = 0.05\). If stronger evidence is required to reject the null hypothesis , then a lower threshold \(\alpha \leq 0.05\) could be used to assess the evidence. If such strong evidence is not required (this may be the case of analyses with very small sample sizes), then a threshold \(\alpha > 0.05\) maybe used. It is convention to not use a threshold greater than 0.1, so any p-value 0.1 is considered large enough to fail to reject the null hypothesis.

As with any statistical inference methods, there is the possibility we have made an error in our conclusions. There are two possible errors we can make when drawing conclusions from the hypothesis test - Type I and Type II error. We see an illustration of these errors in [REF TABLE]

Illustration of Type I and Type II error
Truth
Hypothesis test decision \(H_0\) true \(H_a\) true
Fail to reject \(H_0\) Correct decision Type II error
Reject \(H_0\) Type I error Correct decision

A Type I error has occurred if the null hypothesis is in fact true, but we conclude to reject the null hypothesis. The probability of making this type of error is the decision-making threshold \(\alpha\), because we reject the null hypothesis when we obtain a p-value less than \(\alpha\).

A Type II error has occurred if the alternative hypothesis is in fact true, but we fail to reject the null hypothesis. The probability of making this type of error is less straightforward. It is \(1 - Power\) , where the \(Power = P(\text{reject }H_0 | H_a \text{ true})\).

In the context of the model from this chapter on Palmer penguins, a Type I error is concluding that there is a statistically significant relationship between flipper length and body mass in the model, when in fact there is not. A Type II error is concluding there is not a statistically significant relationship between flipper length and body mass when in fact there is. Given that we reached a conclusion to reject the null hypothesis in the previous section, there is a chance we’ve made a Type I error in this analysis.

Note

Your turn!

Suppose you are concerned with making a Type I error, much more so than you are with making a Type II error . Which \(\alpha\)-level would you choose as the decision-making threshold for your analysis?8

  1. \(\alpha = 0.01\)
  2. \(\alpha = 0.05\)
  3. \(\alpha = 0.1\)

8.6.5 Relationship between CI and hypothesis test

At this point, we have described the two main types of inferential procedures: hypothesis testing and confidence intervals. At this point you may be wondering whether there is any connection between the two. In fact there is!

Conducting the hypothesis test with the two-sided alternative \(\beta_j \neq 0\) with the decision-making threshold \(\alpha\) is equivalent to assessing the \(C\%\) confidence interval, where \(C = (1 - \alpha)\times100\). This means we can use confidence intervals to conduct hypothesis tests in addition to estimating a plausible range of values for the parameter. When using a confidence interval to draw conclusions about a claim, we use the following guide:

  • If the null hypothesized value ( \(0\) in our case based on the tests defined in Section [REF]) within the range of the confidence interval, we conclude to fail to reject \(H_0\) at the \(\alpha\) - level.

  • If the null hypothesized value is not within the range of the confidence interval, we conclude to reject \(H_0\) at the \(\alpha\)- level.

This illustrates the power of confidence intervals, as we not only get the decision reject / fail to reject \(H_0\) but we also get an estimate of the values the parameter likely takes. Therefore, it is good practice to always report the confidence interval so that there is more context to the audience than just the binary reject/fail to reject decision.

Statistical vs. practical significance

Just because we conclude that there is a statistically significant relationship between the response and predictor does not mean that the relationship is significant practically speaking. That is determined by the size of the effect of the predictor on the response and the context of the analysis.

  • Using CI to conduct two-sided tests

  • statistical vs practical significance

8.7 Inference based on the Central Limit Theorem

8.8 Central limit theorem

Thus far we have apporached inference using simulation-based methods (bootstrapping and permutation) to generate distributions to help us understand the variability in the regression coefficients. When certain conditions are met, however, we can rely on results from mathematical theory that tell us about the distribution, and hence the variability, in model coefficients. In this section, we present that theory, then go through the our previous example of inference of \(\hat{\beta}_{flipper\_length\_mm}\) using results from mathematical models. One thing you’ll notice as we go through this section is that type of inference and the process for conducting inference is generally the same as before. The primary difference, then, is in obtaining the sampling distribution and null distribution in order to construct confidence intervals or conduct hypothesis tests.

8.9 Central Limit Theorem

The Central Limit Theorem (CLT) is a foundational theorem in statistics that tells us about the distribution of a statistic and the associated mathematical properties of that distribution. For the purposes of this text, we will focus on what the Central Limit Theorem tells us about the distribution of an estimated slope \(\hat{\beta}_j\), but note that this theorem applies beyond slope coefficients. Here we will also focus on the results of the theorem and less so on the derivation or advanced mathematical details of the distribution as determined by the Central Limit Theorem. Texts such as [INSERT RESOURCES] may be of interest to the reader seeking more theoretical details.

By the Central Limit Theorem, we know under certain conditions (more on those in Section [REF]), \[\hat{\beta}_j \sim N(\beta_j, SE_{\hat{\beta}_j})\]. This means that the distribution of \(\beta_j\) is (1) Normally distributed, (2) has a mean at the true slope \(\beta_j\), and (3) has a standard error of \(SE_{\hat{\beta}_j}\). The center of this distribution \(\beta_j\) is unknown; the purpose of statistical inference is to draw conclusions about this parameter. . The standard error of this distribution \(SE_{\hat{\beta}_j}\) is depends on the variability in the predictor variable along with the regression standard error . We won’t go into the details of calculating \(SE_{\beta_j}\) for multiple linear regression here; however, the reader may find the math details in Section REF. To gain general intuition about \(SE_{\hat{\beta}_j}\), we point the reader to Section [REF] to see the formula in the case of simple linear regression.

In the following sections, we will use what we know about the distribution of \(\hat{\beta}_j\) to conduct hypothesis testing and calculate confidence intervals.

8.9.1 Hypothesis test for the slope

The general idea for hypothesis testing for a slope is the same when using methods based on the Central Limit Theorem as it is when using simulation-based methods. We define a null and alternative hypothesis, conduct testing assuming the null hypothesis is true, and draw a conclusion based on an assessment of the evidence against the null hypothesis. The main difference between these two approaches is in how we obtain the null distribution. In Section [REF] we used permutation in order to generate the null distribution. Now we will rely on the results from the Central Limit Theorem in order to obtain that distribution.

Thus, the steps for conducting hypothesis testing based on the Central Limit Theorem are as follows:

  1. State the null and alternative hypotheses.
  2. Calculate a test statistic.
  3. Calculate a p-value.
  4. Draw a conclusion.

We ultimately have the same goals for testing as we did before; to determine whether there is evidence of a statistically significant linear relationship between response and predictor variable after adjusting for the other variables in the model. Therefore, the null and alternative hypotheses are

\[ H_0: \beta_j = 0 \text{ vs }H_a: \beta_j \neq 0 \\ \text{ given the other variables in the model} \]

This is a two-sided test given the alternative hypothesis is “not equal to 0”. As with simulation-based methods, we could also conduct one-sided tests by specifying a direction (less or greater than 0); however, the same advantages of the two-sided hypothesis that were previously stated remain.

The next step is to calculate a test statistic. The general form of a test statistic is

\[ T = \frac{\text{observed } - \text{ hypothesized}}{\text{standard error}} \]

More specifically, in our hypothesis test for \(\beta_j\), the test statistic is

\[ T = \frac{\hat{\beta}_j - 0}{SE_{\hat{\beta}_j}} \]

Let’s take a moment to consider what this value means. Recall that by the Central Limit Theorem, the distribution of \(\hat{\beta}_j\) is \(N(\beta_j, SE_{\hat{\beta}_j})\). Because we conduct hypothesis testing under the assumption the null hypothesis is true, we are assuming that the center of the distribution of \(\hat{\beta}_j\) is 0 as we conduct the test. Therefore to calculate the test statistic, we shift the observed slope by the mean, and then rescale it by the standard error. Similar to a \(z\)-score in the Normal distribution, the test statistic tells us how far the observed slope is from the center of the distribution assumed by the null hypothesis. The magnitude of the test statistic \(|T|\) provides measure of how far the observed slope is from the center of the distribution, and the sign of \(T\) indicates whether the observed slope is above (positive sign) or below (negative sign) the hypothesized mean.

Your turn!

Consider the magnitude of the test statistic, \(|T|\). Do you think test statistics with low magnitude provide evidence in support or against the null hypothesis? What about test statistics with high magnitude?9

Similar to the simulation-based methods, we will use the test statistic to calculate a p-value and ultimately draw a conclusion about the strength of the evidence in support or against the null hypothesis. The test statistic, \(T\), follows a \(t\) distribution with \(n - p - 1\) degrees of freedom, where \(n\) is the number of observations used to fit the model and \(p\) is the number of predictor terms in the model, i.e., terms not including the intercept. The test statistic is essentially is a standardization of the observed slope under the assumption that the null hypothesis is true, so we use the distributon of the test statistic in place of a simulated null distribution from the permutation test in Section [REF] to understand how far the observed slope is from what we would expect given the null hypothesis is true.

What are degrees of freedom anyway?

Use an intuitive understanding from the simple linear regression case of the number of points needed to fit a line.

As in the case of simulation-based inference, because the alternative hypothesis is “not equal to” and thus we are conducting a two-sided hypothesis test, we calculate the p-vale on both the high and low extremes of the distribution. In other words

\[ p-value = Pr(|t| > |T|) = Pr(t < -|T| \text{ or } t > |T|) \]

As with simulation-based inference, we compare the p-value to a decision-making threshold \(\alpha\) to draw final conclusions. If \(p-value < \alpha\) , we reject the null hypothesis in favor of the alternative. Otherwise, we fail to reject the null hypothesis.

Let’s look at the example from Section [REF] conducting a hypothesis test for \(\beta_{flipper\_length\_mm}\) in Model [REF].

As before, the null and alternative hypotheses are

\[ H_0: \beta_{flipper\_length\_mm} = 0 \text{ vs. }\beta_{flipper\_length\_mm} \neq 0, \text{ given the other variables in the model} \]

The observed slope \(\hat{\beta}_{flipper\_length\_mm}\) is 17.847 and the estimated standard error \(SE_{\hat{\beta}_{flipper\_length\_mm}}\) is 2.902. Therefore, the test statistic is

\[ T = \frac{17.8 - 0}{2.90} = 6.138 \]

This test statistic means that given the true slope is flipper length in this model is 0 and thus the mean of the distribution of \(\hat{\beta}_{flipper\_length\_mm}\) is 0, the observed slope of 17.847 is 6.138 standard errors above the mean. This sounds very far away from the mean, but let’s calculate a p-value to determine the probability of observing a slope at least this far given the null hypothesis is true.

n <- nrow(penguins)
p <- nrow(penguins_m1_tidy) - 1
df <- n - p - 1

There are \(n\) = 333 observations and \(p\) = 5 predictor terms in the model. Thus the test statistic follows a \(t\) distribution with 327 degrees of freedom. The p-value, therefore is

pt(-abs(6.138), 327) + pt(abs(6.138), 327, lower.tail = FALSE)
[1] 2.417635e-09

The p-value \(2.41 \times 10^{-9}\) is very small, so we reject the null hypothesis. The data provide sufficient evidence that the coefficient of flipper length is not 0 in this model and that there is a statistically significant linear relationship between flipper length and body mass for the Palmer penguins.

Note that this conclusion is the same as in Section [REF] using a simulation-based approach. This is what we would expect, given these are the two different approaches for conducting the same process - hypothesis testing for the slope. We are also conducting the tests under the same assumptions that the null hypothesis is true. The difference is in the methods available - simulation versus mathematical models - to us to quantify the sample-to-sample variability in the estimated slopes.

8.9.2 Confidence interval

As with simulation-based inference, we can calculate a confidence interval to estimate a range of values the parameter \(\beta_j\) plausibly takes. As with hypothesis testing, the purpose, interpretation, and conclusions drawn from confidence intervals are the same as before. What differs, however, is how the interval is calculated. In the simulation-based approach, we used bootstrapping to construct a sampling distribution to understand sample-to-sample variability in \(\beta_j\). By the Central Limit Theorem, we quantify the sample-to-sample variability in \(\hat{\beta}_j\) using mathematical models.

The equation for a \((C \times 100) \%\) confidence interval for a slope parameter \(\beta_j\) is

\[ \hat{\beta}_j \pm t^* \times SE_{\hat{\beta}_j} \]

where \(t^* \sim t_{n - p -1}\).

From earlier sections, we know about the estimated slope \(\hat{\beta}_j\) and the standard error \(SE_{\hat{\beta}_j}\). Therefore, let’s focus on \(t^*\), also known as the critical value.

The critical value is the point on the \(t_{n-p-1}\) distribution such that the probability of being between \(-t^*\) and \(t^*\) is \(C\). Thinking about this visually, this is the point such that the area under the curve between \(-t^*\) and \(t^*\) is \(C\). Note that we are still using a \(t\) distribution with \(n - p -1\) degrees of freedom, the same distribution used to calculate the p-value in hypothesis testing. The critical value can be calculated from modern statistical software or using online apps.

Let’s calculate the 95% confidence interval for \(\beta_{flipper\_length\_mm}\) . The critical value on the \(t_{327}\) distribution is 1.967. Note that this is very close to the critical value of 1.96 on the standard normal distribution. Therefore, the 95% confidence interval is

\[ \begin{aligned} 17.8 \pm 1.967 \times 2.90 \\ 17.8 \pm 5.704 \\ [12.096, 23.504] \end{aligned} \]

The interpretation is the same as before: We are 95% confident that the interval 12.096 to 23.504 contains the true slope for flipper length after adjusting for sex, species, and bill length. This means we are 95% confident that for each additional millimeter increase in flipper length, the body mass increases between 12.096 to 23.504 grams, on average, holding sex, species, and bill length constant.

8.9.3 Concluding thoughts

You may have noticed that the standard error, test statistic, p-value and confidence interval we calculated using the mathematical models from the central limit theorem align with what it seen from the output produced by statistical software in Figure [REF]. Modern statistical software will produce these values for you, so in practice you will not typically derive these values “manually” as we did in this chapter. As the data scientist your role will be to interpret the output and use it to draw conclusions. It’s still valuable, however, to have an understanding of where these values come from in order to interpret and use them accurately.

8.10 Inference for predictions

  • Two types of prediction

    • Estimated mean response for a subset of observations

    • Predicted response for an individual observation

8.11 Model conditions and diagnostics


  1. We sample with replacement so that we get a new sample each time we bootstrap. If we sampled without replacement, we would always end up with a sample that looks exactly like our sample data.↩︎

  2. There are 333 observations in each bootstrap sample. Each bootstrap sample is the same size as the sample data.↩︎

  3. There are 1000 values, the number of iterations, in the bootstrapped sampling distribution.↩︎

  4. The points at the \(5^{th}\) and \(95^{th}\) percentiles make the bounds for the 95% confidence interval.↩︎

  5. The points at the \(1^{st}\) and \(99^{th}\) percentiles mark the lower and upper bounds for a 98% confidence interval.↩︎

  6. Each observation represents the estimated coefficient of flipper_length_mm for a given permutation sample.↩︎

  7. Given there is no linear relationship between flipper length and body mass of penguins ( \(H_0\) is true), the probability of observing a slope of 17.847 or more extreme in a random sample of 333 penguins is 0 .↩︎

  8. You would choose a. \(\alpha = 0.01\) so you can minimize the probability of making at Type I error. This means stronger evidence is required to reject the null hypothesis.↩︎

  9. Test statistics with low magnitude provide evidence in support of the null hypothesis, as they are close to the hypothesized value. Conversely test statistics with high magnitude provide evidence against the null hypothesis, as they are very far away from the hypothesized value.↩︎