Skip to main content

Section 5.4 Chi-Squared Test and Independence (C4)

In this section, we compare samples from two different populations, to see if the populations may have the same distribution. In other words, are the outcomes equally likely, no matter which population I draw from?

Exploration 5.4.1. OER vs Predatory Publishers and Student Access.

Students across different sections of the same course are assigned different texts. Some students are assigned free OER resources as their text, and others are assigned overpriced texts from predatory publishers. After 2 weeks the professors take a random anonymous poll of their classes to see how many students have managed to begun the assigned readings in their respective texts. The results are as follows:

\begin{equation*} \begin{array}{|c|c|c|} \hline \textbf{Sample Frequency} \amp \text{Has Done Reading} \amp \text{Has Not Yet Done Reading} \\ \hline \text{OER} \amp 66 \amp 14 \\ \hline \text{Predatory Publisher} \amp 112 \amp 58 \\ \hline \end{array} \end{equation*}

One professor claims that this constitutes a clear difference between the type of resource assigned to students and whether or not they can complete their assigned tasks. Their colleague dismisses it and says any difference is purely due to random chance, and the type of material has no impact on student accesibility, in other words these things are independent.

(a)

How many students in all were surveyed? This is the sample size \(n\text{.}\)

(b)

Let \(R\) denote the event “student has done the reading”. What proportion of the sample has done the reading? This is \(P(R)\text{.}\)

(c)

Note the event \(R^c\) would denote “student has not done the reading”. Compute \(P(R^c)\text{.}\)

(d)

Let \(O\) denote the event “student was assigned an OER”. What proportion of the sample were assigned OER's? This is \(P(O)\text{.}\) (Note the event \(O^c\) would denote “student was assigned a publisher text”.)

(e)

Note the event \(O^c\) would denote “student was assigned a publisher text”. Compute \(P(O^c)\text{.}\)

(f)

Suppose we took the colleague at their word and assumed these events were independent.

Using Remark 2.2.6 compute \(P(R\text{ and }O), P(R\text{ and }O^c), P(R^c\text{ and }O)\) and \(P(R^c\text{ and }O^c)\text{.}\)

(g)

Using the probabilities found in (f), fill out the following table with the expected frequencies if one took a sample of size \(n\) and the probabilities were as in (f).

\begin{equation*} \begin{array}{|c|c|c|} \hline \textbf{Expected Frequency} \amp \text{Has Done Reading} \amp \text{Has Not Yet Done Reading} \\ \hline \text{OER} \amp \amp \\ \hline \text{Predatory Publisher} \amp \phantom{\text{Has Not Yet Done Reading}} \amp \\ \hline \end{array} \end{equation*}

How different are these values from the sample?

Subsection 5.4.1 Testing for Independence

Remark 5.4.1. Test statistics for \(\chi^2\) tests for independence.

When we test to see if the distributions of a variable is independent of which population it comes from, we approach this very similarly to when we did so in Remark 5.3.1. In Remark 5.3.1, we quantify the difference between the frequencies of the sample and the expected frequencies of the outcomes if we assumed they came from a certain distribution. Here, we will quantify the difference between the frequencies of the sample and the expected frequencies of the outcomes if we assumed distributions and populations were different.

Suppose we had \(\ell\) populations \(P_1, \ldots, P_\ell\text{,}\) from which each random variable has \(k\) outcomes \(O_1,\ldots O_k\text{.}\) We also have a sample of size \(n\) where the frequency of occurences from Population \(i\) and Outcome \(j\) is \(n_{i,j}\text{:}\)

\begin{equation*} \begin{array}{|c|c|c|c|c|} \hline \textbf{Sample Frequency} \amp O_1 \amp O_2 \amp \cdots \amp O_k \\ \hline P_1 \amp n_{1,1} \amp n_{1,2} \amp \cdots \amp n_{1,k} \\ \hline P_2 \amp n_{2,1} \amp n_{2,2} \amp \cdots \amp n_{2,k} \\ \hline \vdots \amp \vdots \amp \vdots \amp \vdots \amp \vdots \\ \hline P_\ell \amp n_{\ell,1} \amp n_{\ell,2} \amp \cdots \amp n_{\ell,k} \\ \hline \end{array} \end{equation*}

We first note that the probability that if we selected an arbitrary data point from this sample, the probability it has Population \(P_i\) is the sum of row \(i\) divided by \(n\) the size of the sample. Similarly, the probability it has Outcome \(O_j\) is the sum of column \(j\) divided by \(n\) the size of the sample. From this, IF we were to assume events \(P_i, O_j\) were independent, we would have \(P(P_i\cap O_j)=P(P_i)P(O_j)\) and the expected number of occurences from a sample of size \(n\) from Population \(P_i\) with Outcome \(O_j\) is:

\begin{equation*} E_{i,j}=n\cdot P(P_i)P(O_j)=n\cdot \frac{\text{sum of row $i$}}{n}\frac{\text{sum of column $j$}}{n}=\frac{\text{sum of row $i$}\cdot \text{sum of column $j$}}{n}. \end{equation*}

From here we can compute a table of expected frequencys.

\begin{equation*} \begin{array}{|c|c|c|c|c|} \hline \textbf{Expected Frequency} \amp O_1 \amp O_2 \amp \cdots \amp O_k \\ \hline P_1 \amp E_{1,1} \amp E_{1,2} \amp \cdots \amp E_{1,k} \\ \hline P_2 \amp E_{2,1} \amp E_{2,2} \amp \cdots \amp E_{2,k} \\ \hline \vdots \amp \vdots \amp \vdots \amp \vdots \amp \vdots \\ \hline P_\ell \amp E_{\ell,1} \amp E_{\ell,2} \amp \cdots \amp E_{\ell,k} \\ \hline \end{array} \end{equation*}

Then, much like Remark 5.3.1, \(Z_{i,j}\) computes the test statistic for \(P_i\cap O_j\) and is computed in a similar way:

\begin{equation*} Z_{i,j}=\frac{n_{i,j}-E_{i,j}}{\sqrt{E_{i,j}}}. \end{equation*}

Then once again \(\chi^2\) is the sum of the squares of the \(Z_{i,j}\text{:}\)

\begin{equation*} \chi^2=\sum Z_{i,j}^2. \end{equation*}

Activity 5.4.2. OER vs Predatory Publishers and Student Access: test statistics.

Recall from Exploration 5.4.1 the following table of sample frequencies:

\begin{equation*} \begin{array}{|c|c|c|} \hline \textbf{Sample Frequency} \amp \text{Has Done Reading} \amp \text{Has Not Yet Done Reading} \\ \hline \text{OER} \amp 66 \amp 14 \\ \hline \text{Predatory Publisher} \amp 112 \amp 58 \\ \hline \end{array} \end{equation*}

As well as the table of computed expected frequencies computed in Exploration 5.4.1 (g).

(a)

Use these two tables and Remark 5.4.1 to compute \(Z_{1,1}\) the test statistic for “OER and Has Done Reading”.

(b)

Use these two tables and Remark 5.4.1 to compute \(Z_{1,2}\) the test statistic for “OER and Has not Done Reading”.

(c)

Use these two tables and Remark 5.4.1 to compute \(Z_{2,1}\) the test statistic for “Predatory Publisher and Has Done Reading”.

(d)

Use these two tables and Remark 5.4.1 to compute \(Z_{2,2}\) the test statistic for “Predatory Publisher and Has not Done Reading”.

Remark 5.4.2. Steps to \(\chi^2\) Hypothesis Testing: Independence.

Given the set of hypothesis:

  • \(H_0\text{:}\)“The outcomes are independent of the populations.”

  • \(H_A\text{:}\)“The outcomes are not independent of the populations.”

We compute the \(p\)-value to be the area of the tail on the \(\chi^2\) distribution corresponding to the \(\chi^2\) value computed via Remark 5.4.1 and with \((k-1)(\ell-1)\) degrees of freedom (recall that \(k\) is the number of possible values of the categorical variable.)

As in other hypothesis testing scenarios, the \(p\)-value measures the probability that, if we assume the null hypothesis, that we see values as or more extreme than what was observed.

We then reject or accept the null based on the level of significance \(\alpha\) which is as before usually 0.05 or 5%. If the \(p\)-value is less than \(\alpha\text{,}\) we reject the null hypothesis, otherwise we accept it. In this context, accepting the null is to say the the populations and outcomes are independent. If we reject that then we say it is implausible that they are.

Activity 5.4.3. OER vs Predatory Publishers and Student Access: test statistics.

Recall from Activity 5.4.2 the \(\chi^2\) value you computed.

(a)

What is \(k\) in this problem, what is \(\ell\text{?}\) How many degrees of freedom do we have?

(c)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(d)

If we had a level of significance \(\alpha=0.05\) do we reject the null hypothesis?

(e)

Is it plausible for the type of assigned materials and whether or not students do the reading, to be independent?

Activity 5.4.4. \(\chi^2\) independence testing with R.

We can use R to enter the data and compute the \(\chi^2\) statistic and \(p\)-value.

(a)

Run the following code to input the sample data from Exploration 5.4.1 as a matrix.

If you use this method be sure to not use spaces in your names.

(b)

Run the following code to compute a \(\chi^2\) statistic, \(p\)-value and degrees of freedom.

How do these values compare to what you found in Activity 5.4.3?

Activity 5.4.5. Gender and Protein Preferences.

A restaurateur wonders if there's any difference in the type of meat her customers order and their gender. She surveys 500 customers, 218 men and 282 women. The choices for meat are Beef, Chicken and Pork. The results are as follows:

\begin{equation*} \begin{array}{|c|c|c|c|} \hline \textbf{Sample Frequency} \amp \text{Beef} \amp \text{Chicken} \amp \text{Pork} \\ \hline \text{Men} \amp 65 \amp 108 \amp 45 \\ \hline \text{Women} \amp 84 \amp 136 \amp 62 \\ \hline \end{array} \end{equation*}
(a)

State a null and alternative hypothesis for the \(\chi^2\) independence test.

(b)

Use any method to compute a \(\chi^2\) statistic and a \(p\)-value.

Hint. Desmos

(c)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(d)

If we had a level of significance \(\alpha=0.05\) do we reject the null hypothesis?

(e)

Is it plausible for the gender and meat choice to be independent?

(f)

Fix and run the following code to input the sample data and perform a \(\chi^2\) independence test.

How does the \(\chi^2\) value and the \(p\)-value compare to what you found in (b)?

Activity 5.4.6. Pew Survey on Energy Sources in 2018.

We examine data from a US-based survey on support for expanding six different sources of energy, including solar, wind, offshore drilling, hydrolic fracturing ("fracking"), coal, and nuclear.

Run the following code to download pew_energy_2018.csv data set and to display it's variables. To learn more about this data click here: https://www.openintro.org/data/index.php?data=pew_energy_2018

(a)

A researcher is curious to see if a person's position on expanding solar energy is independent of their attitude torwards expanding coal mining. State a null and alternative for the \(\chi^2\) independence test.

(b)

Run the following code to display a sample frequency table comparing support levels for expanding solar energy as rows and expanding coal mining as columns.

(c)

Run the following code to show a mosaic plots comparing support of expansion for solar and coal energy. What can you tell from this plot?

(d)

Run the following code to run the \(\chi^2\) independence test on energy$solar_panel_farms and energy$coal_mining.

(e)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(f)

If we had a level of significance \(\alpha=0.05\) do we reject the null hypothesis?

(g)

Is it plausible for support for solar and coal expansion to be independent?

(h)

Pick any two energy sources and repeat the steps above for those energy sources.

Activity 5.4.7. Movie Data.

We examine data obtained from IMDB and Rotten Tomatoes. The data represent 456 randomly sampled movies released between 1972 to 2014 in the Unites States.

Run the following code to download movies.Rdata data set and to display it's variables.

(a)

We're curious to see if genre of a movie, and how Rotten Tomatoes rates it, are independent. State a null and alternative for the \(\chi^2\) independence test.

(b)

Run the following code to display a sample frequency table comparing the genre of a movie and their Rotten Tomatoes score.

(c)

Run the following code to show a mosaic plots comparing the genre of a movie and their Rotten Tomatoes score. What can you tell from this plot?

(d)

Run the following code to run the \(\chi^2\) independence test on movies$genre and movies$critics_rating.

(e)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(f)

If we had a level of significance \(\alpha=0.05\) do we reject the null hypothesis?

(g)

Is it plausible for support for the genre of a movie and the Rotten Tomatoes rating to be independent?

(h)

Run the following code to obtain summaries of the variables. Which are categorical?

(i)

Repeat the above steps comparing any two categorical variables of your choice.