Appendix C — Data sets
The data sets used in this book are listed below in alphabetical order.
coffee-ratings.csv: Aroma and flavor grades for 1338 coffees rated in 2018. These data are originally from the Coffee Quality Database and was featured as part of the TidyTuesday data visualization challenge in July 2020.
- This data set is analyzed in Chapter 6.
fivethirtyeight-voters-data.csv: Voting frequency, political party affiliation, and demographic information for 5836 adults in the United States. The data were collected through an online poll conducted by Ipsos and was analyzed in the FiveThirtyEight article “Why Many Americans Don’t Vote” (Thomson-DeVeaux, Mithani, and Bronner 2020). The final sample only includes adults who were eligible to vote for at least four election cycles at time the data were collected in 2020.
- This data set is analyzed Section 13.1 and Section 13.3.
gss24-ai.csv: Demographic information, political leanings, and comfort with driverless vehicles for 1521 adults in the United States. The data were collected in the 2024 General Social Survey administered by National Science Foundation and is administered by National Opinion Research Center (NORC) at the University of Chicago. The data were collected through a combination of online surveys and in-person interviews.
- This data set is analyzed in Chapter 11 and Chapter 12.
lemurs-repeated-measures.csv: Weight, age, and other characteristics for 248 lemurs living at the Duke Lemur Center at the time the data were collected. The data were originally analyzed in Zehr et al. (n.d.) and featured as part of the TidyTuesday data visualization challenge in August 2021. The data set includes the lemurs’ measurements from ages 1 to 24, so there can be multiple measurements for an individual lemur. There are 3715 total observations in the data.
- This data set is analyzed in Section 13.2.
lemurs-sample-young.csv: Weight, age, and other characteristics for 252 lemurs age 24 months or younger living at the Duke Lemur Center at the time the data were collected. The data were originally analyzed in Zehr et al. (n.d.) and featured as part of the TidyTuesday data visualization challenge in August 2021. There is one observation for each lemur.
life-expectancy-data.csv: Information about life expectancy, healthcare, and other societal factors for 140 countries. The data set was obtained from Zarulli et al. (2021) and includes data from the Human Development Database and the World Health Organization.
- This data set is analyzed in Chapter 1.
movie-scores.csv: Critics and audience scores on Rotten Tomatoes for 146 movies released 2014 - 2015. This data set is adapted from the
fandangodata frame in the fivethirtyeight R package (Kim, Ismay, and Chunn 2018).- This data set is analyzed in Chapter 4.
ncaa-basketball-DI-2023-2024.csv: Total expenditure on basketball programs and other features of 355 Division I NCAA colleges and universities in the 2023 - 2024 academic year. The data were collected from the Equity in Athletics Data Analysis (EADA) tool from the Office of Postsecondary Education in the United States Department of Education (ope.ed.gov/athletics).
- This data set is analyzed in Chapter 9 and Section 13.3.
parks.csv: Total expenditure per resident and number of playgrounds per 10,000 residents in 97 of the most populated cities in the United States in 2020. These data were originally analyzed in the 2021 report Parks and an Equitable Recovery (The Trust for Public Land 2021) from the Trust for Public Land. It was featured as part of the TidyTuesday data visualization challenge in June 2021.
- This data set is analyzed in Chapter 5.
penguins: Measurements and other features of 344 penguins at Palmer Station in Antarctica. The data were collected by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program. It is available in the palmerpenguins R package (Horst, Hill, and Gorman 2020).
- This data set is analyzed in Chapter 2.
project-ace-data.csv: Demographic information, Project ACE (Action for Equity) participation, and educational outcomes for 1300 high school students in the United States. The data were obtained from Evans, Perez, and Morera (2025).
- This data set is analyzed in Section 13.4.
recipes.csv: Author, cook time, serving size, and other features of 2218 recipes published on Allrecipes.com between 2009 and 2025. The data is modified from the
cuisinesdata frame in the tastyR R package (Mubia 2025). The data were originally scraped from Allrecipes.com by Brian Mubia.- This data set is analyzed in Chapter 3.
spotify-songs-sample.csv: Music features of 3000 songs on Spotify, a music streaming platform. The data are a subset of data originally analyzed in Pavlik (2019). It was featured as part of the TidyTuesday data visualization challenge (Community 2024) in January 2020.
- This data set is analyzed in Chapter 10.