Section 6.6 ANOVA: Analysis of Variation (N6)
In Section 6.4, we compared populations seperated into two groups to see if some variable differed on average between the groups. In this section, we will look at comparing across more than two groups.
Suppose you had variable of interest across \(k\) seperate populations. So long as the following conditions are met:
Observations are independent across groups.
Observations within groups are approximately normal.
Variability across groups is roughly equal.
Then we may perform ANOVA (Analysis of Variation) to perform a test on the following hypothesis:
\(H_0:\) The means are the same across groups, that is: \(\mu_1=\mu_2\cdots=\mu_k\text{.}\)
\(H_A::\) At least one of the means is different from another.
Exploration 6.6.1. Batting Average and Position.
Is there any difference in batting average between different positions? We examine 1270 players from the 2018 MLB season, and observe their batting average and position: 1B
= first base, 2B
= second base, 3B
= third base, C
= catcher, CF
= center field (outfield), DH
= designated hitter, LF
= left field (outfield), P
= pitcher, RF
= right field (outfield), SS
= shortstop.
Run the following code to download the mlb_players_18.csv
the data set which contains information about 1270 players from the 2018 MLB season and display it's variables:
https://www.openintro.org/data/index.php?data=mlb_players_18
.
(a)
Since they are special players, let's consider a subset of the players without pitchers or designated hitters. Run the following code to produce batters
a subset of mlb18
with no pitchers or designated hitters.
(b)
Run the following code to plot a boxplot for the batting averages of batters
seperated by position.
Exploration 6.6.2. Randomness.
Much like in Exploration 6.4.2, the issue is that it's certainly possible for two different distributions to produce similar looking samples, or for identical distributions to produce very different looking examples.
(a)
Run the following code to produce a data frame with two variables, a group
variable with values A
, B
and C
, and a values
variable that is identical across group, and plot boxplots of values
across the groups.
(b)
Run the following code to produce another data frame with the same variables, but the B
group has a different distribution than the others.
Subsection 6.6.1 Intuition of ANOVA
Remark 6.6.1. Idea behind ANOVA.
The null hypothesis of ANOVA testing is that all of the groups have identical distributions, and the alternative is that they do not. If the null hypothesis were true, then it must follow that all the variation between the groups is driven by the natual variation of the variable itself.
So what we will do is compare the variation between groups, and variation within groups. If the null hypothesis holds, them we would expect that these values are comparable. If there is some discrepency, then we become more suspicious of the null. As usual, this level of suspicion will be measured with a \(p\)-value, which measure the probability that: should the null be true that we see the levels of discrepency or greter than what's observed.
Activity 6.6.3. Fetilizer and Crop Yields: Variation between Groups.
Over the next few activities, we will explore how ANOVA works with a small data set, one that is honestly too samll to be considered a serious data set in practice. However, it will allow us to see the inner workings of ANOVA without getting bogged down in huge computations.
A farmer tests 3 types of fertilizer, A, B and C, on her crops. She segements her farmland into different plots, applies one type of fertilizer for each plot, and in the end, records the bushels per acre for that plot. The results are as follows:
(a)
Compute \(\bar{x}\) the overall sample mean production rate for all 12 plots.
Hint. Desmos(b)
Compute \(\bar{x}_1\) the sample mean production rate for Fertilizer A. Repeat for \(\bar{x}_2\) for Fertilizer B and \(\bar{x}_3\) for Fertilizer C.
Hint. Desmos(c)
Now that we have the overall mean, and the mean of each group, we compute the “variation” amongst the groups.
Compute \(n_1(\bar{x}_1-\bar{x})^2+n_2(\bar{x}_2-\bar{x})^2+n_3(\bar{x}_3-\bar{x})^2\text{,}\) where \(n_i\) is the size of the appropriate group. This computes the “difference” from each groups mean from the overall mean, squared to remove signs, and weighted for the size of the group. Call this value \(SSG\) or the “sum of squares for groups”.
(d)
Since there are \(k=3\) groups, the degrees of freedom for groups: \(df_G\) is \(k-1=3-1=2\text{.}\) Compute the “mean square for groups”: \(MSG=\frac{SSG}{df_G}.\)
This value measure the variation between groups, normalized for the number of groups. (If there were a lot of groups, we would see a lot of variation even if they were distributed identically, so we account for that.)
Activity 6.6.4. Fetilizer and Crop Yields: Variation within Groups.
We now continue from Activity 6.6.3 to analyze variation within groups.
(a)
Luckily there is a statistic which measures variation (it's implied by the name). Compute \(Var_1\text{,}\) the sample variance for the yield rates for Fertlizer A. Repeat for \(Var_2, Var_3\text{.}\)
Hint. Desmos(b)
Now that we have the variance within each group, we compute the the total variation.
Compute \((n_1-1)Var_1+(n_2-1)Var_2+(n_3-1)Var_3\text{.}\) This computes the total variance within groups, weighted for the size of the group. Call this value \(SSE\) or the “sum of squares for errors”.
(c)
Compute the “degrees of freedom for errors” by summing the degrees of freedom for each group: \(df_{E}=(n_1-1)+(n_2-1)+(n_3-1).\)
(d)
Compute the “mean square error”: \(MSE=\frac{SSE}{df_E}.\)
This value measure the variation within groups, normalized by the size of the groups.)
Activity 6.6.5. Fetilizer and Crop Yields: \(F\)-statistic and \(p\)-value.
We continue from Activity 6.6.4 to produce a \(p\)-value.
(a)
Compute \(F=\frac{MSG}{MSE}\text{.}\) This is a comparison between the variation between groups and the variation within groups.
(b)
The \(F\)-distribution is a random variable who has two parameters, d_f1
and d_f2
two sets of degrees of freedom. We compute the proportion of the distribution greater than \(F\) where d_f1
\(=df_G\text{,}\) d_f2
\(=df_E\text{.}\)
Adjust the values for d_f1
, d_f2
and F
to compute the \(p\)-value p
.
(c)
Edit and run the following code to produce the \(p\)-value through R
.
Remark 6.6.2.
The details of the \(F\) distribution are extremely technical. This is what you should take away: \(F=\frac{MSG}{MSE}\text{,}\) where \(MSG\) measure the variability between groups and \(MSE\) measure the variability within groups.
If \(MSE\) is big compared to \(MSG\) then there is a lot of variability within groups. This variability easily explains the variability between the groups. Thus \(F\) is small, and the tail, and thus \(p\)-value, is big. We fail to reject the null because the variability within groups plausibly explains our data.
On the other hand, if \(MSG\) is big compared to \(MSE\text{,}\) then there is a lot of variability between groups. More than the natural variation can explain. There is likely some actual differences between the groups. In these cases, \(F\) is big, the tails are small and so are the \(p\)-values. We reject the null (when \(p\)-value \(\lt 0.05\)) when it's no longer plausible that natural variation within groups can explain our data.
Remark 6.6.3. Recap of ANOVA.
To recall the steps of ANOVA from Activity 6.6.3, Activity 6.6.4 and Activity 6.6.5: suppose that we have a \(n\) total observations from \(k\) different groups. There are \(n_i\) observations from group \(i\text{.}\) We then:
Compute \(\bar{x}\) the overall sample mean.
Compute \(\bar{x}_i\) the sample mean for each group.
Compute \(SSG=\sum n_i(\bar{x}_i-\bar{x})^2\) the sum of squares for groups.
Compute \(df_G=k-1\) the degrees of freedom for groups.
Compute \(MSG=\frac{SSG}{df_G}\) the mean squares for groups.
Compute \(Var_i\) for each group.
Compute \(SSE=\sum (n_i-1)Var_i\) the sum of squares for errors.
Compute \(df_E=\sum (n_i-1)=n-k\) the degrees of freedom for errors.
Compute \(MSE=\frac{SSE}{df_E}\) the mean squares for errors.
Compute \(F=\frac{MSG}{MSE}\) the \(F\) statistic.
Compute \(P(F\lt X)\) where \(X\) is a \(F\) distributed variable with parameters \(df_G, df_E\text{.}\) This value is the \(p\)-value.
Reject the null hypothesis if \(p\)-value \(\lt 0.05\text{.}\)
Marvel that even for statistics, this is a lot of computation.
Subsection 6.6.2 ANOVA and R
Remark 6.6.4.
Note that none of the steps in Remark 6.6.3 are particularly difficult, just long and tedious. But they're fairly straightforward tasks, and one can automate much of the work. By entering the group sizes, sample means and sample standard deviations into N
, M
, S
below, one can compute all the necessary values:
But we could automate more than that.
Activity 6.6.6. Fetilizer and Crop Yields: R
.
We can use commands in R
to do ANOVA directly from data.
(a)
Run the following code to produce a dataframe crops
with variables yield
and fertilizer
.
(b)
Run the following code to create an aov
model of crop
and summarize it.
(c)
State the meaning of the \(p\)-value within the context of this problem in a complete sentence.
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
Activity 6.6.7. Fetilizer and Crop Yields: R
.
We can finish what we started with Exploration 6.6.1.
(a)
Run the following code to re-plot a boxplot for the batting averages of batters
seperated by position.
(b)
Run the following code to create an aov
model of batters
and summarize it.
(c)
State the meaning of the \(p\)-value within the context of this problem in a complete sentence.
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
Activity 6.6.8. Relaxation and Degree Attainment: R
.
Is there any impact on degree attainment and the amount of time spent relaxing?
Run the following code to download the gss2010.csv
data set which contains information from the 2010 social survey and display it's variables:
https://www.openintro.org/data/index.php?data=gss2010
.
(a)
Run the following code to rplot a boxplot for the hours of relaxation in a day seperated by degree attainment.
(b)
Run the following code to create an aov
model of gss2010
and summarize it.
(c)
State the meaning of the \(p\)-value within the context of this problem in a complete sentence.
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
(f)
mntlhlth
measures “For how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?”
Re-run the above comparing this variable across degree attainment.
(g)
hrs1
measures “Hours worked each week.”
Re-run the above comparing this variable across degree attainment.
Activity 6.6.9. Starbucks Nutrition and Item Type: R
.
Is there any impact on the type of item Starbucks serves, bakery
, bistro box
, hot breakfast
, parfait
, petite
, salad
, sandwich
,and nutrition?
Run the following code to download starbucks.csv
a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.
https://www.openintro.org/data/index.php?data=starbucks
.
(a)
Run the following code to rplot a boxplot for fiber
fiber in g, seperated by food type type
.
(b)
Run the following code to create an aov
model of starbucks
and summarize it.
(c)
State the meaning of the \(p\)-value within the context of this problem in a complete sentence.
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
(f)
Re-run the above comparing another nutrtional variable across food type.