Section 6.6 ANOVA: Analysis of Variation (N6)
Observations are independent across groups.
Observations within groups are approximately normal.
Variability across groups is roughly equal.
The means are the same across groups, that is: At least one of the means is different from another.
Exploration 6.6.1. Batting Average and Position.
Is there any difference in batting average between different positions? We examine 1270 players from the 2018 MLB season, and observe their batting average and position: 1B
= first base, 2B
= second base, 3B
= third base, C
= catcher, CF
= center field (outfield), DH
= designated hitter, LF
= left field (outfield), P
= pitcher, RF
= right field (outfield), SS
= shortstop.
Run the following code to download the mlb_players_18.csv
the data set which contains information about 1270 players from the 2018 MLB season and display it's variables:
Click here to learn more about this data set:xxxxxxxxxx
mlb18 = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/mlb_players_18.csv")
β
names(mlb18)
https://www.openintro.org/data/index.php?data=mlb_players_18
.
(a)
Since they are special players, let's consider a subset of the players without pitchers or designated hitters. Run the following code to produce batters
a subset of mlb18
with no pitchers or designated hitters.
xxxxxxxxxx
batters=subset(mlb18, mlb18$position!="P" & mlb18$position!="DH")
summary(batters)
(b)
Run the following code to plot a boxplot for the batting averages of batters
seperated by position.
Is it obvious that these values come from different distributions? Is it obvious they couldn't have?xxxxxxxxxx
boxplot(batters$AVG~batters$position)
Exploration 6.6.2. Randomness.
Much like in Exploration 6.4.2, the issue is that it's certainly possible for two different distributions to produce similar looking samples, or for identical distributions to produce very different looking examples.
(a)
Run the following code to produce a data frame with two variables, a group
variable with values A
, B
and C
, and a values
variable that is identical across group, and plot boxplots of values
across the groups.
xxxxxxxxxx
group1=c(rep("A", 30), rep("B", 30), rep("C", 30))
values1=c(rnorm(30,100,35), rnorm(30,100,35), rnorm(30,100,35))
β
dummy1 = data.frame(group1, values1)
summary(dummy1)
boxplot(dummy1$values1~dummy1$group1)
(b)
Run the following code to produce another data frame with the same variables, but the B
group has a different distribution than the others.
xxxxxxxxxx
group2=c(rep("A", 30), rep("B", 30), rep("C", 30))
values2=c(rnorm(30,100,35), rnorm(30,90,35), rnorm(30,100,35))
β
dummy2 = data.frame(group2, values2)
summary(dummy2)
boxplot(dummy2$values2~dummy2$group2)
Subsection 6.6.1 Intuition of ANOVA
Remark 6.6.1. Idea behind ANOVA.
The null hypothesis of ANOVA testing is that all of the groups have identical distributions, and the alternative is that they do not. If the null hypothesis were true, then it must follow that all the variation between the groups is driven by the natual variation of the variable itself.
So what we will do is compare the variation between groups, and variation within groups. If the null hypothesis holds, them we would expect that these values are comparable. If there is some discrepency, then we become more suspicious of the null. As usual, this level of suspicion will be measured with a
Activity 6.6.3. Fetilizer and Crop Yields: Variation between Groups.
Over the next few activities, we will explore how ANOVA works with a small data set, one that is honestly too samll to be considered a serious data set in practice. However, it will allow us to see the inner workings of ANOVA without getting bogged down in huge computations.
A farmer tests 3 types of fertilizer, A, B and C, on her crops. She segements her farmland into different plots, applies one type of fertilizer for each plot, and in the end, records the bushels per acre for that plot. The results are as follows:
(a)
Compute
Hint. Desmosxxxxxxxxxx
β
(b)
Compute
Hint. Desmosxxxxxxxxxx
β
(c)
Now that we have the overall mean, and the mean of each group, we compute the βvariationβ amongst the groups.
Compute
(d)
Since there are
This value measure the variation between groups, normalized for the number of groups. (If there were a lot of groups, we would see a lot of variation even if they were distributed identically, so we account for that.)
Activity 6.6.4. Fetilizer and Crop Yields: Variation within Groups.
We now continue from Activity 6.6.3 to analyze variation within groups.
(a)
Luckily there is a statistic which measures variation (it's implied by the name). Compute
Hint. Desmosxxxxxxxxxx
β
(b)
Now that we have the variance within each group, we compute the the total variation.
Compute
(c)
Compute the βdegrees of freedom for errorsβ by summing the degrees of freedom for each group:
(d)
Compute the βmean square errorβ:
This value measure the variation within groups, normalized by the size of the groups.)
Activity 6.6.5. Fetilizer and Crop Yields: -statistic and -value.
We continue from Activity 6.6.4 to produce a
(a)
Compute
(b)
The d_f1
and d_f2
two sets of degrees of freedom. We compute the proportion of the distribution greater than d_f1
d_f2
Adjust the values for d_f1
, d_f2
and F
to compute the p
.
(c)
Edit and run the following code to produce the R
.
xxxxxxxxxx
pf(FIXME_F, FIXME_dfG, FICME_dfE, lower.tail = FALSE)
Remark 6.6.2.
The details of the
If
is big compared to then there is a lot of variability within groups. This variability easily explains the variability between the groups. Thus is small, and the tail, and thus -value, is big. We fail to reject the null because the variability within groups plausibly explains our data.On the other hand, if
is big compared to then there is a lot of variability between groups. More than the natural variation can explain. There is likely some actual differences between the groups. In these cases, is big, the tails are small and so are the -values. We reject the null (when -value ) when it's no longer plausible that natural variation within groups can explain our data.
Remark 6.6.3. Recap of ANOVA.
To recall the steps of ANOVA from Activity 6.6.3, Activity 6.6.4 and Activity 6.6.5: suppose that we have a
Compute
the overall sample mean.Compute
the sample mean for each group.Compute
the sum of squares for groups.Compute
the degrees of freedom for groups.Compute
the mean squares for groups.Compute
for each group.Compute
the sum of squares for errors.Compute
the degrees of freedom for errors.Compute
the mean squares for errors.Compute
the statistic.Compute
where is a distributed variable with parameters This value is the -value.Reject the null hypothesis if
-valueMarvel that even for statistics, this is a lot of computation.
Subsection 6.6.2 ANOVA and R
Remark 6.6.4.
Note that none of the steps in Remark 6.6.3 are particularly difficult, just long and tedious. But they're fairly straightforward tasks, and one can automate much of the work. By entering the group sizes, sample means and sample standard deviations into N
, M
, S
below, one can compute all the necessary values:
But we could automate more than that.
Activity 6.6.6. Fetilizer and Crop Yields: R
.
We can use commands in R
to do ANOVA directly from data.
(a)
Run the following code to produce a dataframe crops
with variables yield
and fertilizer
.
xxxxxxxxxx
fertilizer=c("A", "A", "A", "A", "B", "B", "B",
"C","C","C","C","C")
yield=c(176, 173, 175, 179, 168, 171, 172, 177, 179, 173, 176, 181)
β
crops = data.frame(yield, fertilizer)
(b)
Run the following code to create an aov
model of crop
and summarize it.
How do these values compare to what we found before?xxxxxxxxxx
cropsaov=aov(yield~fertilizer, data=crops)
summary(cropsaov)
(c)
State the meaning of the
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
Activity 6.6.7. Fetilizer and Crop Yields: R
.
We can finish what we started with Exploration 6.6.1.
(a)
Run the following code to re-plot a boxplot for the batting averages of batters
seperated by position.
xxxxxxxxxx
boxplot(batters$AVG~batters$position)
(b)
Run the following code to create an aov
model of batters
and summarize it.
xxxxxxxxxx
battersaov=aov(AVG~position, data=batters)
summary(battersaov)
(c)
State the meaning of the
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
Activity 6.6.8. Relaxation and Degree Attainment: R
.
Is there any impact on degree attainment and the amount of time spent relaxing?
Run the following code to download the gss2010.csv
data set which contains information from the 2010 social survey and display it's variables:
Click here to learn more about this data set:xxxxxxxxxx
gss2010 = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/gss2010.csv")
β
names(gss2010)
https://www.openintro.org/data/index.php?data=gss2010
.
(a)
Run the following code to rplot a boxplot for the hours of relaxation in a day seperated by degree attainment.
xxxxxxxxxx
boxplot(hrsrelax ~ degree, data=gss2010)
(b)
Run the following code to create an aov
model of gss2010
and summarize it.
xxxxxxxxxx
relaxaov=aov(hrsrelax ~ degree, data=gss2010)
summary(relaxaov)
(c)
State the meaning of the
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
(f)
mntlhlth
measures βFor how many days during the past 30 days was your mental health, which includes stress, depression, and problems with emotions, not good?β
Re-run the above comparing this variable across degree attainment.
(g)
hrs1
measures βHours worked each week.β
Re-run the above comparing this variable across degree attainment.
Activity 6.6.9. Starbucks Nutrition and Item Type: R
.
Is there any impact on the type of item Starbucks serves, bakery
, bistro box
, hot breakfast
, parfait
, petite
, salad
, sandwich
,and nutrition?
Run the following code to download starbucks.csv
a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.
Click here to learn more about this data set:xxxxxxxxxx
starbucks = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/starbucks.csv")
β
names(starbucks)
https://www.openintro.org/data/index.php?data=starbucks
.
(a)
Run the following code to rplot a boxplot for fiber
fiber in g, seperated by food type type
.
xxxxxxxxxx
boxplot(fiber ~ type, data=starbucks)
(b)
Run the following code to create an aov
model of starbucks
and summarize it.
xxxxxxxxxx
starbucksaov=aov(fiber ~ type, data=starbucks)
summary(starbucksaov)
(c)
State the meaning of the
(d)
Do we reject the null hypothesis?
(e)
What sort of error could have been made? (Type 1 or Type 2)
(f)
Re-run the above comparing another nutrtional variable across food type.