The t-Variable and Sampling Distribution of Numerical Variables (N1)

Section 6.1 The \(t\)-Variable and Sampling Distribution of Numerical Variables (N1)

In this section, we transition away from categorical variables and begin looking at the sampling distributions of numerical variables.

Remark 6.1.1.

Recall from Section 1.4 that given a numerical variable \(X\) and a sample from \(X\) of size \(n\text{,}\) we can obtain a sample mean \(\bar{x}\text{.}\)

As we know, each time we sample \(X\text{,}\) we obtain potentially a different \(\bar{x}\text{,}\) so \(\bar{x}\) is a random variable from a sampling distribution. The random variable \(\bar{x}\) will play the role similar to that of \(\hat{p}\) in Chapter 5. We recognize \(\bar{x}\) as a point setimate for the true mean of \(X\text{:}\) \(\mu_X\text{.}\)

Exploration 6.1.1. Rolling Dice and Averaging the results..

Suppose I roll dice_num dice and take the average result. Let \(X\) be the random variable indicating the outcome of a six-sided die.

(a)

Following Section 2.4,find \(E(X), Var(X), \sigma_X\text{.}\)

(b)

Let \(A_2\) be the average of two die rolls: \(A_2=\frac{1}{2}(X_1+X_2).\) Using Remark 2.5.1, find \(E(A_2), Var(A_2), \sigma_{A_2}.\)

(c)

Run the following code to simulate trials=1000 trials where one rolls num_dice=2 dice and takes the average, and display the mean and standard deviation of these outcomes:

How do these values compare to what you found in (b)?

(d)

Let \(A_3\) be the average of two die rolls: \(A_2=\frac{1}{3}(X_1+X_2+X_3).\) Using Remark 2.5.1, find \(E(A_3), Var(A_3), \sigma_{A_3}.\)

(e)

Re-run the code in (c) with num_dice=3, how do those values compare to what you found in (d)?

(f)

Re-run the code in (c) with num_dice=30, num_dice=50, num_dice=100 What do we notice about the mean and sd as num_dice increases? How does the shape of the graph change?

Subsection 6.1.1 \(t\)-Variables

As we saw in Exploration 6.1.1, we saw a phonomena very similar to what we saw in Theorem 4.1.4. In Theorem 4.1.4, the sampling distribution for \(\hat{p}\) was approximately normal. We have a similar result for the sampling distribution for \(\bar{x}\text{.}\)

Theorem 6.1.2. The Central Limit Theorem (Numerical).

Let \(X\) be a numerical variable with true mean \(\mu\) and standard deviation \(\sigma\text{.}\) Then for a sufficiently large sample (typically \(\geq 30\)) the sampling distribution of \(\bar{m}\) is approximately normally distributed with pararmeters

\begin{equation*} \mu_{\bar{x}}=\mu,\ \ \ SE_{\bar{m}}=\frac{\sigma}{\sqrt{n}}. \end{equation*}

Remark 6.1.3.

We can see how the average die rolls of Exploration 6.1.1 follow this. Run the following code to plot the histogram from Exploration 6.1.1, along with a normal curve with parameters mean mu=3.5 and standard deviation SE=sqrt(35/12)/sqrt(num_dice):

Edit num_dice and see how the curve matches the histogram.

Activity 6.1.2. Variation in Parameters.

With categorical variables, the mean and standard deviation of the sampling distribution is totally determined by one parameter: \(p\text{.}\) We can see this in the statement of Theorem 4.1.4. So when we assume a null hypothesis and set \(p=p_0\text{,}\) this strictly determines the hypothetical sampling distribution.

As we see in Theorem 6.1.2, the sampling distribution for \(\bar{x}\) is determined by both \(\mu\) and \(\sigma\text{.}\) So if one were to conduct a hypothesis test and set \(\mu=\mu_0\text{,}\) there's still the question of what \(\sigma\) is. The general practice is to assume it's the same as the sample standard deviation, but how good of an estimate is this?

Run the following code to download and the ames.csv data set which contains information of houses in Ames, Iowa, and to see it's variable names:

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=ames.

(a)

Run the following code to sample 30 random houses from Ames, Iowa and display their housing prices:

(b)

Run the following code to show the standard deviation of housing prices in this sample:

(c)

Run the following code to sample 30 new random houses from Ames, Iowa and display the standard deviation of their housing prices:

How does this outcome compare to what you found in (a)?

(d)

Run the following code to, for trials=1000 times, sample 30 new random houses from Ames, Iowa and record the standard deviation of their housing prices, and display a histogram of the standard deviations:

What do we notice about the possible variation in sample standard deviations?

(e)

Run the following code to redisplay the above histogram, along with the actual standard deviation of housing prices in Ames, Iowa:

What do we notice about the possible variation in sample standard deviations?

(f)

To see how much variation there could be in sampling distributions, run the following code to plot a normal curve for each sample standard deviation we found, each curve has mean 0, but standard deviation equal to one we found in our above simulation:

How much variation is there between curves, even with the same means?

(g)

Run the following code to, for trials=1000 times, sample now n=500 new random houses from Ames, Iowa and record the standard deviation of their housing prices, and display a histogram of the standard deviations:

What do we notice about the possible variation in sample standard deviations compared to what we saw in (d)?

Subsection 6.1.2 The \(t\)-distribution

Definition 6.1.4. \(t\)-Random Variable.

Looking at Activity 6.1.2, we notice:

A normal curve where we use a sample standard deviation \(s\) in lieu of \(\sigma\) may be highly inaccurate, since there is a lot of variation of what \(s\) is in comparison to \(\sigma\text{.}\)
This variation decreases as we increase \(n\) the size of the sample.

To account for this, we introduce what we call the standard \(t\)-distribution (mean 0, standard deviation 1). The technical formula for a \(t\)-distribution is uninteresting and tedious. The main thing we want to understand about the standard \(t\)-distribution with mean \(0\) and standard deviation 1, is that it has an additional parameter: degrees of freedom: \(d_f\text{.}\) For a \(t\)-distribution, the number of degrees of freedom is \(d_f=n-1\) where \(n\) is the sample size.

We notice that when \(d_f\) is small, the \(t\)-distribution resembles a much widened normal distribution to account for the variability in the sample standard deviation. As \(d_f\) increases, the \(t\)-distribution grows to resemble a normal distribution.

Activity 6.1.3. Probabilities and the \(t\)-Distribution.

Probabilities of \(t\) variables are computed in a manner similar to that of normal variables, via areas as in Activity 3.1.2.

Below, we can see that for a \(t\) variable with 8 degrees of freedom, the probability that \(t\) is less than \(-0.7\) is \(P(t_8\lt -0.7)\approx 0.2519.\)

We can also confirm this by running the following:

(a)

For a \(t\)-variable with 19 degrees of freedom, compute the probability \(t\) is greater than 1.1: \(P(t_{19}>1.1)\text{.}\)

(b)

For a \(t\)-variable with 100 degrees of freedom, compute the probability \(t\) is between -2 and 1.5: \(P(-2\lt t_{100}\lt1.5)\text{.}\)

(c)

For \(t\) variables with degrees of freedom 5, 20 and 200, compute \(P(-1\lt t\lt1)\text{.}\) What do we notice as the degrees of freedom increase?

(d)

For \(t\) variables with degrees of freedom 5, 20 and 200, compute \(P(t>2)\text{.}\) What do we notice as the degrees of freedom increase?

Activity 6.1.4. Probabilities and the \(t\)-Distribution: Inverses.

Just as for normal variables (see: Activity 3.1.3), given an area or probability, we should be able to recover bounds for it.

Below, we can see that for a \(t\) variable with 13 degrees of freedom, if we wanted to find a vlaue \(t_{13}^*\) so the probability that \(t\) is less than \(t_{13}^*\) is 70%, or \(P(t_{13}\lt t_{13}^*)=0.7\text{,}\) we can see that the value is about \(t_{13}^*\approx 0.5375\text{.}\)

We can also confirm this by running the following:

(a)

For a \(t\)-variable with 13 degrees of freedom, find \(t_{13}^*\) such that \(P(t_{13}>t_{13}^*)=0.45\text{.}\)

(b)

For a \(t\)-variable with 45 degrees of freedom, find \(t_{45}^*\) such that \(P(t_{45}\lt t_{45}^*)=0.55\text{.}\)

(c)

For \(t\) variables with degrees of freedom 5, 20 and 200, find find \(t^*\) such that \(P(-t^*\lt t\lt t^*)=0.85\text{.}\) What do we notice as the degrees of freedom increase?

Activity 6.1.5. \(t\)-score for Sampling Distribution Variables.

Just as for general normal variables, most sampling distributions do not have mean 0 and standard deviation 1. Just as with general normal variables, we each value of the sampling distribution has a corresponding \(t\)-score on our general \(t\)-distribution .

Consider a sampling distribution with mean \(\mu\) and standard deviation \(SE\text{.}\) For each value \(X\) from this distribution, we can compute the \(t\)-score for \(X\) via:

\begin{equation*} t=\frac{X-\mu}{SE} \end{equation*}

as in Definition 3.2.1.

(a)

For a sampling distribution with mean \(\mu=45\text{,}\) standard error \(SE=12\) find the \(t\)-score of \(X=60\text{.}\)

(b)

For a sampling distribution with mean \(\mu=120\text{,}\) standard error \(SE=26\) find the \(X\) values for the \(t\)-scores: -1.97, 1.97.

Activity 6.1.6. Putting it Together.

Suppose you sampled a numerical variable 20 times, and obtained the following values:

\begin{equation*} 20, 36, 23, 38, 42, 29, 47, 23, 30, 37, \end{equation*}

\begin{equation*} 45, 44, 34, 19, 31, 30, 14, 48, 50, 18. \end{equation*}

(a)

Compute the sample mean \(\bar{x}\text{,}\) and the standard deviation \(s\text{,}\) for the above sample.

(b)

What is \(n\) the size of the sample? Use Theorem 6.1.2 to compute the standard error for the sampling distribution: \(SE\text{.}\)

(c)

Use Definition 6.1.4 to compute the degrees of freedom of the sampling distribution.

(d)

If we assume the sampling distribution has mean \(\mu=30\text{,}\) use this mean and \(SE\) to compute a \(t\)-score for \(\bar{x}\text{,}\) \(t_{\bar{x}}\text{.}\)

(e)

Compute \(P(t>t_{\bar{x}})\text{.}\)