Comparing Two Categorical Variables (C2)

Section 5.2 Comparing Two Categorical Variables (C2)

In this section, we consider what happens when we want to compare proportions between 2 categorical variables.

Exploration 5.2.1. Gender and Raises.

A watchdog group concerned about gender discrimination at a company randomly polls 528 of the company's employees, 312 men and 216 women. At the end of last year of those surveyed, 102 employees recieved raises: 72 men and 30 women. The watchdog group points out that the proportion of men who got raises is significantly greater than the proportion of women who got raises and suggests this is due to inequitable practices within the company. The company claims that any difference between the rates and men and women getting raises is purely by random chance.

(a)

Find \(\hat{p}\) the proportion of the sampled employees of the company who got a raise.

(b)

How many more men than women in this sample got raises this year?

(c)

The company claims that rates of raises are equal across gender and any deviation in a given year is by chance. If this were true, we would expect more men than women to get raises since there are more men in the sample. But by how much?

Let's put the companies claim to the test. Run the following code to simulate a raise_rate of \(\hat{p}\) found in (a). We then simulate how many num_men=312 and num_women=216 got raises and calculate their difference d.

How does this value compare to what you found in (b)?

(d)

Let's take it a step further. Fix and run the following code to simulate a 1000 random promotion schedules, and plot a distribution of the differences between men and women promoted. Enter in observed_diff to be the value found in (b)

How likely is the difference you observed in (b)?

(e)

Run the following code to see the proportion of trials where the difference in promotions was as big or greater as found in (b).

(f)

How plausible is the companies claim that this differences in promotion rates is purely due to chance?

Subsection 5.2.1 Differences of Proportion Variables

Remark 5.2.1.

In this section, we sample two categorical variables \(X_1, X_2\text{.}\) Each variable has its own sample size \(n_1, n_2\) and own sample proportion \(\hat{p_1}, \hat{p_2}\text{.}\) If we want to compare the proportions of each variable, we do so by looking at their difference:

\begin{equation*} D=\hat{p_1}-\hat{p_2}. \end{equation*}

Remark 5.2.2.

Recall that via Theorem 4.1.4 that the Variance of a sampling proportion distribution is

\begin{equation*} Var_{\hat{p}}=\frac{p(1-p)}{n}. \end{equation*}

Activity 5.2.2. Smoking and Baby Weight.

A 1967 study shows that the probability that a baby has low birth weight is about 7.8% if the mother smokes, and 2.9% if the mother does not.

(a)

Suppose we sampled 500 babies from mothers who smoke (\(n_1=500\)). We are given the probability a given baby is underweight is \(p_1=0.078\text{.}\) Following Theorem 4.1.4 what is the expected proportion of this sample to be underweight: \(\hat{p_1}\text{?}\)

(b)

Use Remark 5.2.2 to find the variance of \(\hat{p_1}\text{:}\) \(Var_{\hat{p_1}}\text{.}\)

(c)

Suppose we sampled 1000 babies from mothers who do not smoke (\(n_2=1000\)). We are given the probability a given baby is underweight is \(p_2=0.029\text{.}\) Following Theorem 4.1.4 what is the expected proportion of this sample to be underweight: \(\hat{p_2}\text{?}\)

(d)

Use Remark 5.2.2 to find the variance of \(\hat{p_2}\text{:}\) \(Var_{\hat{p_2}}\text{.}\)

(e)

Let \(D\) denote the difference between the sampling distributions. Use Remark 2.5.1 to find \(\mu_D\text{.}\)

(f)

Use Remark 2.5.1 to find the variance of \(D\text{:}\) \(Var_D\text{.}\)

(g)

Find the standard deviation of \(D\text{:}\) \(SE_D=\sqrt{Var_D}\text{.}\)

(h)

Run the following code to simulate 1000 trials of sampling 500 babies from mothers who smoke, 1000 babies from mothers who do not smoke, taking the difference in proportions of babies who are underweight, and plotting a histogram of the results.

(i)

Fix and run the following code so that mu_D=\(\mu_D\) and SE_D=\(\sigma_D\) and plot a normal curve on top of our histogram.

How well does the normal distribution approximate the histogram?

Remark 5.2.3. Difference of Proportions.

Let there be two random categorical variables, with population proportions \(, p_1, p_2\text{.}\) Then let \(D\) be the random variable generated by taking random samples of size \(n_1, n_2\) of the two variables and taking the difference in their proportions. Then \(D\) is approximated by a normal variable with mean and standard deviation:

\begin{equation*} \mu_D=p_1-p_2, SE_D=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}. \end{equation*}

Activity 5.2.3. Toppings on Frozen Desserts.

Let's suppose that 60% of ice cream eaters and 45% of frozen yogurt eaters put toppings on their frozen dessert of choice. We sample 80 ice cream eaters and 110 frozen yogurt eaters.

(a)

Run the following code to sample the ice cream eaters, see how many got toppings, and see what proportion of them got toppings.

(b)

Run the following code to sample the frozen yogurt eaters, see how many got toppings, and see what proportion of them got toppings.

(c)

Run the following code to compute the difference in proportions between the samples.

(d)

Run the following code to simulate 1000 trials of random samplings, and plot a histogram of their differences.

(e)

Given that \(p_1=0.6, n_1=80, p_2=0.45, n_2=110\text{,}\) use Remark 5.2.3 to compute \(\mu_D\) the mean of the distribution of differences, and \(SE_D\text{,}\) the standard deviation of the distribution of differences.

(f)

Fix and run the following code to overlay a normal curve with mean mu_D=\(\mu_D\) and standard deviation SE_D=\(SE_D\) over the histogram you found.

How well does this normal curve approximate the distribution of differences?

Subsection 5.2.2 Confidence Intervals for Differences of Proportions

Remark 5.2.4.

Following the same reasoning as in Section 4.2 we can compute a confidence interval for the differences of true proportions. Suppose you had two categorical variables from whom you collect samples of size \(n_1, n_2\) and obtain sample proportions \(\hat{p}_1, \hat{p}_2\text{.}\) Then a C% confidence interval for the difference between the true proportions \(D=p_1-p_2\) is an interval centered at \(d=\hat{p}_1-\hat{p}_2\) which has a C% chance of containing D.

We can compute this interval to be

\begin{equation*} [(p_1-p_2)-z^*SE_D, (p_1-p_2)+z^*SE_D] \end{equation*}

where \(SE_D=\sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}\) as in Remark 5.2.3 and \(z^*\) is found as in Remark 4.2.5:

\begin{equation*} \begin{array}{|c|c|} \hline C\% \amp z^*\\ \hline 90\% \amp 1.645\\ 95\% \amp 1.96\\ 99\% \amp 2.576 \\ \hline \end{array} \end{equation*}

Example 5.2.5. Gender and Raises revisted again.

We recall from Exploration 5.2.1 that we had \(n_1=312\) men whose sample raise rate was \(\hat{p_1}=\frac{72}{312}\) and there were \(n_2=216\) women whose sample raise rate was \(\hat{p_2}=\frac{30}{216}\text{.}\) In this Activity, we also found that

\begin{equation*} SE_D=\sqrt{\frac{\frac{72}{312}\left(1-\frac{72}{312}\right)}{312}+\frac{\frac{30}{216}\left(1-\frac{30}{216}\right)}{216}}\approx 0.033506. \end{equation*}

The differences between the sample proportions is

\begin{equation*} d=\frac{72}{312}-\frac{30}{216}\approx 0.09188. \end{equation*}

So we could compute a 95% confidence interval for the differences of proportions by

\begin{align*} \amp[(p_1-p_2)-z^*SE_D, (p_1-p_2)+z^*SE_D]\\ \amp \approx [0.09188-1.96\cdot 0.033506, 0.09188+1.96\cdot 0.033506]\\ \amp \approx[0.0262, 0.1576]. \end{align*}

That is, there is a 95% chance that the difference in raise rates between men and women is between 2.62% and 15.76%. Note that the companies claim was that raise rates were the same, that is, the difference in raise rates was zero which seems unlikely, given this evidence.

Activity 5.2.4. Spicy Wings.

A restaurateur owns two seperate restaraunts, and East and West location. Before contacting her vendors, she want's to know if there is any difference in the rate at which customers at these locations order Spicy Wings. Her plan is to poll 100 customers from each location, and compute a 90% confidence interval for the difference in ordering rates of Spicy Wings.

(a)

Run the following code to poll 100 customers in the East location and see how many order spicy wings.

We know that \(n_1=100\text{,}\) compute \(\hat{p_1}\text{,}\) the proportion of the East location sample who orders Spicy Wings.

(b)

Run the following code to poll 100 customers in the West location and see how many order spicy wings.

We know that \(n_2=100\text{,}\) compute \(\hat{p_2}\text{,}\) the proportion of the West location sample who orders Spicy Wings.

(c)

Following Remark 5.2.4, find \(d, SE_D\text{.}\)

(d)

Use Remark 5.2.4 and \(d, SE_D\) to find a 90% confidence interval for the difference in East location ordering rates of Spicy Wings and West location ordering rates of Spicy Wings. Is 0 in this interval?

(e)

Do you have good evidence that the ordering rates between the locations are different? If so, who orders more wings?

(f)

Run the following code to print the lower and upperbound of the confidence interval to check your answer.

(g)

Run the following code to print the actual differnce between the ordering rates. Is it in your interval? (10% of the time it won't be!).

We know that \(n_2=100\text{,}\) compute \(\hat{p_2}\text{,}\) the proportion of the West location sample who orders Spicy Wings.

Subsection 5.2.3 Hypothesis Testing with two Categorical Variables

Remark 5.2.6.

We approach Hypothesis Testing of the difference of proportion variables in a similar spirit to Remark 4.3.15. From Remark 5.2.3, we let \(D\) be the difference in sample proportions generated by samples of size \(n_1, n_2\) from variables with true proportions \(p_1, p_2\text{.}\) Then our alternative hypothesis take the form of:

\(\displaystyle H_A:p_1-p_2\lt p_0\)
\(\displaystyle H_A:p_1-p_2> p_0\)
\(\displaystyle H_A:p_1-p_2\neq p_0\)

We note here \(p_1-p_2=\mu_D\text{.}\) The null hypothesis for all of these is \(H_0: p_1-p_2=p_0\text{.}\)

Then, given an observed differnce in proportions: \(d=\hat{p_1}-\hat{p_2}\) we compute the \(p\)-values by assuming the null hypothesis \(p_1-p_2=p_0\) and then computing:

\(\displaystyle P(D\lt d)\)
\(\displaystyle P(D> d)\)
\(\displaystyle P(|D-p_0|> |d-p_0|)\)

Note that we compute these by finding the areas of a right, left or both tails as in Definition 4.3.8.

Remark 5.2.7.

If the null hypothesis is that \(p_1-p_2=0\text{,}\) which is to say \(p_1=p_2\text{,}\) then this is what is assumed when we assume the null hypothesis. If both populations have the same proportion, then we may as well treat it as one big population. So if we take samples of size \(n_1, n_2\) and have “successes” \(x_1, x_2\text{,}\) then e then compute a “pooled” sample proportion:

\begin{equation*} \hat{p}_{pool}=\frac{x_1+x_2}{n_1+n_2}. \end{equation*}

We then use this pooled proportion as both \(\hat{p_1}\) and \(\hat{p_2}\) for the purpose of computing the standard error.

So we would use

\begin{equation*} SE_D=\sqrt{\frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_1}+\frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_2}}. \end{equation*}

Activity 5.2.5. Gender and Raises: Hypothesis Test.

Recall that from Exploration 5.2.1 that we have \(n_1=312\) men and \(n_2=216\) women, of whom \(x_1=72\) men and \(x_2=30\) women got raises. The company claims that men and women have equal raise rates. The watchdog group claims that men get more raises than women. Suppose we use a \(\alpha=0.05\) level of significance.

Hint. Desmos

(a)

Which of the following best describes the null hypothesis \(H_0\text{?}\)

\(p_1-p_2\lt 0\text{.}\)
\(p_1-p_2> 0\text{.}\)
\(p_1-p_2\neq 0\text{.}\)
\(p_1-p_2= 0\text{.}\)

(b)

Which of the following best describes the alternative hypothesis \(H_A\text{?}\)

\(p_1-p_2\lt 0\text{.}\)
\(p_1-p_2> 0\text{.}\)
\(p_1-p_2\neq 0\text{.}\)
\(p_1-p_2= 0\text{.}\)

(c)

Compute \(p_{pool}\text{.}\)

(d)

Assuming the null hypothesis, follow Remark 5.2.7 to compute \(\mu_D, SE_D\text{.}\)

(e)

Compute \(d=\hat{p_1}-\hat{p_2}\) the difference in proportions of men and women who got raises.

(f)

Let \(X\) be the normal variable mean \(\mu_{D}\) and standard deviation \(SE_{D}\text{.}\) Compute \(P(D> p_0+|p_0-d|)\text{.}\)

(g)

Use the above value to compute the \(p\)-value. (This is an area of one tail.)

(h)

Compute the \(z\)-score for \(\hat{p}\text{,}\) call this \(z_{\hat{p}}\text{.}\)

(i)

Let \(Z\) denote the standard normal variable. Compute \(P(Z>|z_{\hat{p}}|)\) and \(P(Z\lt -|z_{\hat{p}}|)\text{.}\) How does these values compare to what you found in (f) and (g)?

(j)

Compute the \(p\)-value.

(k)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(l)

Do we reject the null hypothesis?

(m)

What sort of error could have been made? (Type 1 or Type 2)

(n)

Run the following code to sample num_men=312 men where \(men_{raise}=72\) get raises, then samples num_women=216 men where \(women_{raise}=30\) get raises. It does this trials=1000 times and plots a histogram of the differences.

(o)

Fix and run the following code to plot a normal curve with mean mu_D=\(p_0\) and standard deviation SE_D=\(SE_D\text{,}\) and plot a line for the observed difference observed_diff=\(d\)

How well does this curve match the histogram?

(p)

Run the following code to see what proportion of these differences are greater than or equal to the observed_diff. How does this compare to the \(p\)-value?

Remark 5.2.8.

In contrast to Remark 5.2.7, if the null hypothesis is that \(p_1-p_2=p_0\text{,}\) where \(p_0\neq 0\text{,}\) then we're not actually claiming that the population proportions are equal.

So when we assume the null, we would use

\begin{equation*} \mu_D=p_0, SE_D=\sqrt{\frac{\hat{p}_{1}(1-\hat{p}_{1})}{n_1}+\frac{\hat{p}_{2}(1-\hat{p}_{2})}{n_2}}. \end{equation*}

Activity 5.2.6. Study Guide and Exam Pass Rates.

Suppose that a group of student's are studying for the notoriousluy difficult probability Actuarial exam. A publishing company claim's it's materials raises the average pass rate by more than 10%. A Math professor is skeptical of this claim. Suppose that out of out of 30 people who used the material, 15 passed and 40 people who didn't use the material, 14 passed.

Let \(p_1\) denote the population proportion of people who used the material and passed, and \(p_2\) denote the population proportion of people who did not use the material and passed.

Hint. Desmos

(a)

Which of the following best describes the null hypothesis \(H_0\text{?}\)

\(p_1-p_2\lt 0.1\text{.}\)
\(p_1-p_2> 0.1\text{.}\)
\(p_1-p_2\neq 0.1\text{.}\)
\(p_1-p_2= 0.1\text{.}\)

(b)

Which of the following best describes the alternative hypothesis \(H_A\text{?}\)

\(p_1-p_2\lt 0.1\text{.}\)
\(p_1-p_2> 0.1\text{.}\)
\(p_1-p_2\neq 0.1\text{.}\)
\(p_1-p_2= 0.1\text{.}\)

(c)

Find or compute \(n_1, n_2, \hat{p_1}, \hat{p_2}\text{.}\)

(d)

Assuming the null hypothesis, follow Remark 5.2.8 compute \(\mu_D, SE_D\text{.}\)

(e)

Compute \(d=\hat{p_1}-\hat{p_2}\) the difference in proportions of men and women who got raises.

(f)

Let \(X\) be the normal variable mean \(\mu_{D}\) and standard deviation \(SE_{D}\text{.}\) Compute the \(p\)-value \(P(X> d)\text{.}\)

(g)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(h)

Do we reject the null hypothesis?

(i)

What sort of error could have been made? (Type 1 or Type 2)

(j)

Is a 10% increase in scores plausible?

Activity 5.2.7. Cancer in Dogs.

A study in 1994 examined 491 dogs that had developed cancer and 945 dogs as a control group to determine whether there is an increased risk of cancer in dogs that are exposed to the herbicide 2,4-Dichlorophenoxyacetic acid (2,4-D). We analyze their data and see what their conclusion would be.

Run the following code to download cancer_in_dogs.csv and display it's variables.

order records whther or not they were exposed to 2,4-D, and response records whether or not they got cancer or no cancer. To see a brief breakdown of the data, run the following code:

Let \(p_1\) denote the probability that a dog exposed to 2,4-D get's cancer and let \(p_2\) denote the proportion of dogs who aren'e exposed to 2,4-D get cancer.

(a)

Which of the following best describes the null hypothesis \(H_0\text{?}\)

\(p_1-p_2\lt 0\text{.}\)
\(p_1-p_2> 0\text{.}\)
\(p_1-p_2\neq 0\text{.}\)
\(p_1-p_2= 0\text{.}\)

(b)

Which of the following best describes the alternative hypothesis \(H_A\text{?}\)

\(p_1-p_2\lt 0\text{.}\)
\(p_1-p_2> 0\text{.}\)
\(p_1-p_2\neq 0\text{.}\)
\(p_1-p_2= 0\text{.}\)

(c)

Run the following to show a two-way table of how many dogs were/weren't expose to 2,4-D and who got/didn't get cancer:

(d)

How many dogs from the sample were exposed to 2,4-D? Out of those dogs, how many got cancer? Let these be \(n_1, x_1\) respectively.

(e)

How many dogs from the sample were not exposed to 2,4-D? Out of those dogs, how many got cancer? Let these be \(n_2, x_2\) respectively.

(f)

Fix and run the following to perform the appropriate proportion hypothesis test. Enter in how many dogs were exposed to 2,4-D and got Cancer, were not exposed to 2,4-D and got Cancer, were exposed to 2,4-D, and wern't exposed to 2,4-D respectively.

(g)

State the meaning of the \(p\)-value within the context of this problem in a complete sentence.

(h)

Do we reject the null hypothesis?

(i)

What sort of error could have been made? (Type 1 or Type 2)

(j)

Is it plausible to say that being exposed to 2,4-D do not have an increased risk of cancer?