We start every analysis with exploratory data analysis (EDA) to better understand the observations in the data, the distributions of the variables, and to gain initial insights about the relationships between the variables of interest. EDA can also help us identify outliers or other unusual observations, missing data, and potential errors in the data, such as errors in how the data were recorded or how the data set was loaded into the statistical software.
We’ll do an exploratory data analysis that focuses only on the two variables that will be in the regression model. In practice, however, we may want to explore other variables in the data set (e.g., year in this example) to provide additional context later on as we interpret results from the regression model. We begin with univariate EDA, exploring one variable at a time, then we’ll conduct bivariate EDA to look at the relationship between critics and audience scores.
3.0.1 Univariate EDA
The univariate distributions of the critics and audience scores are visualized in Figure 4.1 and summarized in Table 4.1.
Code
p_critics <-ggplot(data = movie_scores, aes(x = critics)) +geom_histogram(binwidth =10, fill ="steelblue", color ="black" ) +labs(x ="Critics Score", y ="Count") +xlim(0,100)p_audience <-ggplot(data = movie_scores, aes(x = audience)) +geom_histogram(binwidth =10, fill ="steelblue", color ="black") +labs(x ="Audience Score", y ="Count") +xlim(0,100)p_critics + p_audience
Figure 3.1: Univariate distributions of critics scores and audience scores on Rotten Tomatoes.
Table 3.1: Summary statistics for audience and critics score
Variable
Mean
SD
Min
Q1
Median (Q2)
Q3
Max
Missing
critics
60.8
30.2
5
31.2
63.5
89
100
0
audience
63.9
20.0
20
50.0
66.5
81
94
0
The description of the univariate distribution has four components:
Shape: A description of the shape includes the skewness (left-skewed, right-skewed, symmetric) and the number of modes, i.e., peaks (unimodal, bimodal, multimodal).
Center: The mean or median are typically used to describe the center of the distribution. To determine which measure is the best representation of the center, consider the shape of the distribution and whether there are outliers. If the distribution is approximately symmetric, then the mean is the better measure of center. One reason for this is that the mean is calculated using all the values in the data set, in contrast to the median which only takes into account the middle value (or middle two values if there are an even number of observations). The mean, however, is affected by skewness in the data and the presence of outliers. Therefore, if either of these are present, the median is the more reliable measure of the center of the distribution.
Spread: The standard deviation or inter-quartile range (IQR) are used to describe the spread. If the data are approximately symmetric with no outliers, the standard deviation is a good measure of the spread. The standard deviation is impacted by skewness and outliers, because the mean is used to calculate it. If the distribution is skewed or has outliers, then the IQR, the difference between the percentile and the percentile , is a more reliable measure of spread.
Reporting center and spread
To describe the center and spread of a distribution, report
the mean and standard deviation, or
the median and IQR
Using range as a measure of spread
The range is commonly used to measure the spread of a distribution. The range, however, should be used with caution and not reported as the only measure of spread. Because it only takes into account the minimum and maximum, it only gives describes the spread of the extreme ends of the distribution, not the spread of the middle where a majority of the data typically lie. Additionally, it is heavily affected by outliers, so it can be a potentially misleading measure of the spread.
Outliers or other notable patterns: The last part of describing a univariate distribution is describing outliers or other interesting and unusual patterns in the data, if they are present in the distribution. Outliers can be observations that just happen to be different from the others (e.g,. LeBron James’ (a very famous NBA player) salary compared to the salary of 1000 randomly selected adults in the United States); however, they may also be due to data entry errors (e.g., a person’s age recorded as 150 years). Unusual patterns are those that may not follow what we would expect, such as a mode at an unexpected value. This commonly happens in practice with modes at values such as -1 or 0, which are intended to represent missing data rather than actual observed values.
Once the outliers have been identified and better understood from further investigation, there are options on how to address them. If they are merely unusual observations, it is good practice to keep them in the analysis or compare models fit with and without these observations. If we remove them from the analysis, we must note that they’ve been removed and discuss potential limitations in the scope of the conclusions. If the outliers are a result of a data entry error, then it is recommended to correct the value, if it is possible to determine what the intended value is, or remove the observation from the analysis. Again, it is important to document how the outlying observations are handled and handle them in a reproducible way. Handling outliers in regression is discussed in more detail in (ch-slr-conditions?).
Putting all this together, we now describe the univariate distribution of the predictor variable critics.
The distribution of critics is left-skewed, meaning the movies in the data set are generally more favorably reviewed by critics (more observations with higher critics scores). Given the apparent skewness, the center is the median score of 63.5. The IQR describing the spread of the middle 50% of the distribution is 57.8 points , so there is a lot of variability in the critics scores for the movies in the data. There are no apparent outliers, but we observe from the raw data that there are two notable observations of movies that have perfect critics scores of 100. There are no missing values of critics score.
3.1 Your turn!
Use the histogram in Figure 4.1 and summary statistics in Table 4.1 to describe the distribution of the response variable audience.
3.1.1 Bivariate EDA
After we’ve examined the variables individually, we begin to explore the relationships between variables. We’ll focus on the relationship between the response and predictor variable we’re studying; however, there may be other variable relationships we want to understand to provide additional context to the results from the regression model.
Similar to univariate EDA, we visualize the relationship between variables and calculate summary statistics to better quantify the relationships. A scatterplot of the the audience score versus critics score is shown in Figure 4.2. When making the scatterplot, we put the predictor variable on the -axis (horizontal axis) and the response variable on the -axis (vertical axis).
Code
ggplot(data = movie_scores, mapping =aes(x = critics, y = audience)) +geom_point(alpha =0.5) +labs(x ="Critics Score" , y ="Audience Score") +theme_bw()
Figure 3.2: Relationship between critics and audience scores on Rotten Tomatoes.
The correlation, , is a measure of the direction and strength of the linear relationship between two variables. It ranges from -1 to 1, with meaning a very strong negative relationship, a strong positive relationship, and a very weak to no linear relationship. The correlation between critics and audience score is = 0.78.
Similar to univariate EDA, we include several features when describing the relationship between the two variables. These are shape, direction (if applicable), strength, outliers, and other interesting features. Below is an explanation of each component, followed by a description of the relationship between critics and audience score.
Shape: The shape is the general pattern of the points in the scatterplot. The most common shapes we may see are linear, quadratic, cubic, and no discernible pattern.
Direction: If the shape is linear, then we can describe the overall direction of the points. The direction is positive if tends to increase as increases, negative if tends to decrease as increases, and no direction if is approximately the same for all values of . The sign of the correlation coincides with the direction of linear relationships.
Strength: The strength is a measure how closely the observations follow the overall pattern or shape. Points that are tightly clustered together indicate a stronger relationship than points that are more dispersed. When the shape is linear, the correlation quantifies the strength of the relationship between the variables.
Outliers: As in univariate EDA, outliers are points that do not follow the general pattern of the data. These can be points that are outliers in the -direction, the direction, or both. These points are important to identify in the EDA, as they may influence the regression model. We’ll talk more about the impact of outliers on the regression model in a later section.
Other interesting features: There may be other features of the scatterplot that are interesting to highlight. For example, there may be different variability (spread) in the points as increases. These kind of interesting features may help provide additional explanation as we assess the fit of the regression model and understand the estimates.
Below is a summary of the bivariate EDA for the movie scores data.
There is a positive, linear relationship between the critics and audience scores for movies on Rotten Tomatoes. The correlation between these two variables is 0.78, indicating the relationship is moderately strong. Therefore, we can generally expect the audience score to be higher for movies with higher critics scores. There are no apparent outliers, but there does appear to be more variability in the audience score for movies with lower critics scores than for those with higher critics scores.