Preface

Statistical models have become an increasingly prominent part of everyday life. For years, algorithms have provided recommendations on almost anything ranging from the music we will like or products we want to purchase. Algorithms have also been used in more consequential decision-making, such as identifying bank accounts for potential fraud or informing the length of sentencing in the United States court system. More recently, artificial intelligence, in particular Large Language Models, has changed how we approach work, school, and everyday tasks. Given the broad impact statistical models have on modern life, it is important that the people who develop these models and the people who will use models to inform decision-making understand how these models work and the implications of decisions made during the model development process.

This book is an introduction to the world of modeling, with a focus on linear and logistic regression models. These models are widely used in industry and academia, and they provide a foundation for many models commonly used in machine learning and data science. The ideas presented in this book around how we approach model building, interpret results, and use models to inform decision making nicely extend to advanced modeling techniques. We’ll see a glimpse of this in Chapter 13  Special topics. Overall, this book provides an in-depth introduction to modeling that prepares the reader to use regression models in practice and to study advanced modeling techniques.

The following are the key points we aim for readers to learn about regression as they read the text.

  1. Regression is an powerful tool for exploring relationships in the world around us.
  2. Regression is as much art as it is science. There is rarely a single “correct” way to approach a problem. In fact, all approaches have inherent advantages and limitations.
  3. Analysis decisions are largely informed by the data and analysis objective. It is important to understand the data, what question(s) can be answered from the data, and the scope of conclusions that can be drawn from the data.
  4. It is important to understand of the decisions made throughout the analysis process in order to effectively use regression models to draw insights and conclusions.

Audience

This book is intended for readers who have had an introduction to data science or statistics and are seeking a more in-depth study of regression analysis. It aims to equip readers with knowledge and skills to use regression analysis in practice in academia or industry. It is primarily written to serve as a textbook for an applied regression analysis course but can also be used for self-study for readers interested in a robust understanding of regression analysis for their work or research.

We assume the reader has taken an introductory level statistics or data science course or is familiar with the topics found in texts such as Introduction to Modern Statistics (Çetinkaya-Rundel and Hardin 2024) or Statistical Inference via Data Science: A ModernDive into R and the tidyverse (Ismay and Kim 2019). Throughout the book, we will briefly review some introductory topics, as needed. We refer readers to these and similar texts for a more in-depth introduction to statistics and data science.

We also assume readers interested in the computing aspects of this book have some familiarity with R and the tidyverse. 2  Data analysis in R provides a computing review that can also serve as a brief introduction. We refer readers to R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023) for a resource on computing using the tidyverse.

Structure

The book is divided into four parts: Getting started, Simple linear regression, Multiple linear regression, and Beyond linear regression.

  • Part 1: Getting started - This part introduces foundational concepts that are utilized throughout the book. Chapter 1  Regression in the data science workflow is an introduction to regression analysis and how it fits in the data science landscape. It also introduces a data science workflow that is used in the assignments in the supplemental materials. Chapter 2  Data analysis in R provides a review (or brief introduction) to R and the tidyverse. It focuses on many of the data manipulation and data analysis functions that are used in the text. It also introduces Quarto, a system for technical documents that is used in the assignments. Chapter 3  Exploratory data analysis introduces exploratory data analysis and it is used to understand distributions of individual variables and the relationships between multiple variables. It also discusses strategies for cleaning data and handling unusual features in the data.

  • Part 2: Simple linear regression - This part covers the details of regression in the context of simple linear regression. The concepts introduced in this section extend to the multiple linear regression models introduced in Part 3. Chapter 4  Simple linear regression introduces the simple linear regression model, estimating and interpreting model coefficients, prediction, and model evaluation. Chapter 5  Inference for simple linear regression introduces simulation-based and theory-based inference for model coefficients. Chapter 6  Model conditions and diagnostics discusses model conditions and diagnostics.

  • Part 3: Multiple linear regression - This part extends the concepts from Part 2 to multiple linear regression models with two or more predictors. Chapter 7  Multiple linear regression introduces the multiple linear regression model, estimating and interpreting model coefficients, working with different types of predictors, interaction effects, and prediction. Chapter 8  Inference for multiple linear regression introduces simulation-based and theory-based inference for coefficients in a multiple linear regression model, along with model conditions and diagnostics. Chapter 9  Variable transformations introduces models with transformations on the response and/or predictor variables. Chapter 10  Model selection covers model assessment, model selection, and cross validation.

  • Part 4: Beyond linear regression - This part introduces models for data that do not meet the conditions for linear regression. Chapter 11  Logistic regression introduces logistic regression for binary response variables. It discusses estimating and interpreting model coefficients, simulation-based and theory-based inference for model coefficients, model conditions, and model diagnostics. Chapter 12  Logistic regression: Prediction and evaluation covers prediction and model evaluation for logistic regression. Chapter 13  Special topics is an introduction to a collection of models that are extensions of linear and logistic regression. These models include multinomial logistic regression, random intercepts models, decision trees, and models for causal inference.

The book also includes appendices covering the mathematics underlying linear and logistic regression. These appendices utilize the matrix representation of the models, so they are intended for readers with some familiarity with linear algebra.

Using this book for a course

The content and structure of this book are based on undergraduate regression analysis courses that have been taught by the author at Duke University. Each course follows a 15-week semester.

Applied Regression Analysis: Undergraduate regression analysis course focused on application. Pre-requisites are introductory statistics or introductory probability.

Regression Analysis with Theory: Undergraduate regression analysis course focused on application and mathematical theory. Pre-requisites are introductory statistics or introductory probability and linear algebra.

Key features

Beginning with 3  Exploratory data analysis, the chapters are written as case studies based on a real-world data and stated analysis objective. The case studies in this book show the variety of contexts in which regression analysis can be applied. The chapters begin with an introduction to the data and exploratory data analysis that is focused on the variables and relationships relevant to the chapter. The data sets used in each chapter are available in Appendix C — Data sets. An analysis objective anchors the chapter and readers are walked through the analysis process as they learn and apply new concepts.

There are call out boxes throughout the book to help guide the reading experience.

  • Analysis objective: Highlights the main analysis question we seek to answer using the methods introduced in the chapter.

  • Analysis in practice: Provides practical tips and insights about using regression analysis in work and research.

  • Math details: Includes concepts and mathematical facts that are used for computations or derivations.

  • Your turn: Encourages the reader to check understanding using practice questions on the concepts introduced in the chapter. Answers (or example responses) are posted as footnotes.

Computing

Computing is an important and necessary aspect of conducting regression analysis in practice. Therefore, R output and code are included in each chapter. The code is the book primarily follows the tidyverse (Wickham et al. 2019) and tidymodels (Kuhn and Wickham 2020) syntax; they are introduced in Chapter 2  Data analysis in R.

Inspired by the structure of An Introduction to Statistical Learning (James et al. 2021), each chapter contains a section about the code used to apply the new concepts. This format is intended to help readers focus on conceptual understanding before diving into the code. Additionally, it is intended to make the text more accessible to readers not using R.

Supplemental materials

The companion website (https://introregression-resources.netlify.app) contains homework assignments, computing labs, and other resources that accompany the text. These are based on a workflow using R and Quarto and have been used in undergraduate regression analysis courses at Duke University. Many of these assignments have also been adapted and used by instructors at other institutions.

About the author

Acknowledgements