Section 4.1 Point Estimates (F1)
We now begin combining the concepts learned in previous chapters to get to the real heart of statistics. Variables that occur in the world that we care about are Random Variables that we discuss in Chapter 2. But we almost never can work with the variable directly, just a sample of data from that variable as in Chapter 1. We will use what we learned about Chapter 3 to make predictions about the original random variable.
In this section, we identify point estimates for true parameters.
Subsection 4.1.1 Point Estimates vs Parameters of interest
Definition 4.1.1.
A statistic about a population we wish to measure or estimate is a parameter of interest. An estimate of that statistic is a point estimate. The difference between the the point estimate and the parameter of interest (positive or negative) is the error. The sampling error is how much the error will tend to vary between samples. The sample size is the size of the sample.
Bias describes a systematic tendency to over or underestimate the true population value. We use the sampling techniques in Chapter 1 to minimize bias.
Example 4.1.2. Male African Elephants.
To predict the mean weight of Male African Elephants, 100 Male African Elephants were weighed, and the average of these 100 elephants was 6175 kg. The actual mean weight of Male African Elephants is 6300 kg.
The parameter of interest is the mean weight of Male African Elephants.
The point estimate is the mean weight of the 100 sampled Male African Elephants, 6175 kg.
The error is the difference between the point estimate and the parameter of interest, -225 kg.
The sample size is 100 elephants.
Activity 4.1.1. Parameters of Interest.
For each of the following, determine if the paramter of interest is a mean or a proportion. (To help answer this, consider if the underlying variable is numerical or categorical.)
(a)
200 college students were surveyed to see what the cost of their textbooks were.
(b)
200 college students were surveyed to see if their professors used Open Educational Resources in their courses.
(c)
In a survey, adult women were asked if they experienced sexual discrimination in their place of work.
(d)
Adults were asked in a survey how much they spent on groceries every month.
(e)
Highschool students were asked how many hours a week they spend on the internet.
(f)
Adults were asked if an advanced degree was neccesary for their current occupation.
Subsection 4.1.2 Variability in Point Estimates
Exploration 4.1.2. American's Support for Solar Energy.
Suppose that the proportion of American adults who support the expansion of solar energy were 88%. If we sampled 1000 random American's, we wouldn't neccesarily expect exactly 88% of them to support Solar Energy would we? Maybe we would get a group of 1000 that had more support for solar, or less. How much variability can we expect?
(a)
Run the following code to create a vector of 250 million Americans, of whom 88% support solar energy and 12% do not:
(b)
Run the following code to sample 1000 Americans from this Population:
(c)
Run the following code to see what proportion of the sample group support solar energy:
What is the error?(d)
Rerun (b) and (c), what do we notice about the error?
(e)
Run the following code to simulate 10,000 possible surveys, and display a histogram of the proportion of respondents who support solar energy:
What do we notice about the shape of the histogram?Remark 4.1.3.
Notice that every time we sampled, we get a potentially different proportion of respondents who support solar energy. Thus, the proportion of respondents who support solar energy (\(\hat{p}\)) is a random variable and thus has a mean, and a standard deviation. We will call this standard deviation the standard error as it measure the tendency of \(\hat{p}\) to deviate from it's mean.
In general, we call \(\hat{p}\) the sample proportion or the proportion of a sample which satisfies some condition, which is a point-estimate for \(p\) the population proportion (which satisfies the same condition). The distribution of \(\hat{p}\) is the sampling distribution.
Theorem 4.1.4. The Central Limit Theorem (Proportion).
For a sufficiently large sample (typically \(pn, (1-p)n\geq 10\) or at least 10 “successes” and “failures”) the sampling distribution of\(\hat{p}\) is approximately normally distributed with pararmeters
This is very similar to Remark 3.4.5
Activity 4.1.3. Normal Approximation for Solar Energy Support.
In Exploration 4.1.2 we have a population where the proportion of support for solar energy is \(p=0.88\text{.}\) Moreover, we have a sample size of \(n=1000\text{.}\)
(a)
Compute \(np, n(1-p)\) and ensure that both values are at least 10.
(b)
Follow Theorem 4.1.4 to compute \(\mu_{\hat{p}}\) and \(SE_{\hat{p}}\text{.}\)
(c)
Fix and run un the following code to display the histogram from Exploration 4.1.2 (e), overlayed with a normal curve with mean mu_p
=\(\mu_{\hat{p}}\) and SE_p
=\(SE_{\hat{p}}\text{.}\)
(d)
Use the normal approximation for \(\hat{p}\) with mean \(\mu_{\hat{p}}\) and standard deviation \(SE_{\hat{p}}\) to compute \(P(\hat{p}\lt0.85)\text{.}\)
(e)
Use the normal approximation for \(\hat{p}\) to find an interval \([0.88-k, 0.88+k]\) so that there is a 95% chance \(\hat{p}\) falls in this interval.
(f)
Take a moment and appreciate how closely the sample proportions model the population proportion, despite 1000 adults being a paltry percentage of 250 million adults.
Activity 4.1.4. Merchandise Purchasing.
A recording artist wants to know what percentage of attendees of their shows purchase merchandise at the shows. They randomly survey 100 attendees of their recent shows, and found that 22 of them purchased merchandise.
(a)
What is the population under consideration?
People who listen to music.
100 attendees.
People who attended the recording artist's shows.
22 people who purchased merchandise.
(b)
What is the parameter of interest?
The average number of shows fans attend a year.
Proportion of show attendees who purchase merchandise.
The average amount of money spent on merchandise.
Proportion of people who bought the artist's album who attended shows.
(c)
What is the point estimate (\(\hat{p}\)) for the parameter?
(d)
Use Theorem 4.1.4 and the point estimate to compute the Sample Error for the point estimate.
(e)
Suppose that the actual percentage of show attendees who purchase merchandise is 25%. Is this suprising?
(f)
Recompute the Sample Error using \(p=0.25\) Is it much different than what you found in (d)?
(g)
Fix and run the following code to simulate the 1000 surveys of size n
=100, probability p
=0.25, and point estimate phat
\(=\hat{p}=22/100\text{,}\) as well as plot a histogram of the survey results, and the normal approximation:
Activity 4.1.5. Adults who Smoke.
A research team wishes to decide what percentage of adults in a town smokes. In an survey of 232 adults from the town, 48 of them smoke.
(a)
What is the population under consideration?
Adults who smoke.
The adults who live in the town.
232 adults.
48 smokers.
(b)
What is the parameter of interest?
Proportion of adults who smoke.
Proportion of adults living in this town who smoke.
Average number of times adults smoke each week.
Proportion of adult humans on earth who live in this town.
(c)
What is the point estimate for the parameter?
(d)
Use Theorem 4.1.4 and the point estimate to compute the Sample Error for the point estimate.
(e)
Suppose that the actual percentage of adults who smoke in the town is 20%. Is this suprising?
(f)
Recompute the Sample Error using \(p=0.2\) Is it much different than what you found in (d)?
(g)
Fix and run the following code to simulate the 1000 surveys of size n
, probability p
and point estimate phat
, as well as plot a histogram of the survey results, and the normal approximation: