Section 7.3 Inference and Linear Regression (R3)
The slope and vertical intercept, \(\beta_1, \beta_0\) we computed in Section 7.2 are determined by a sample of (pairs of) random variables, and as such, are merely point estimates of these paramters. As usual, there is some chance we chose non-representative data from the sample.
In this section, we use our sample data to make inferences on the true parameters.
Exploration 7.3.1. Estimating Regression.
Let X
and Y
be variables where \(Y=0.8X+5+\epsilon\) is a distributed by a normal variable with mean 0 and standard deviation 3.
(a)
What line probably best fits the relationship between X
and Y
?
(b)
Run the following code to generate n=10
random X
and Y
values, plot them, find a regression line and the line in (a):
(c)
Adjust n=100
and run it again, now what do you notice?
Subsection 7.3.1 Hypothesis Testing
Remark 7.3.1.
When it comes to linear regression, we focus on the following pair of hypothesis. :
\(H_0: \beta_1=0\) or the slope is zero. So changes in the explanatory variable do not result in average change in response variable.
\(H_0: \beta_1\neq 0\) or the slope is not zero. So changes in the explanatory variable do result in average change in response variable.
This is a numerical hypothesis test similar to what we have done before. However we're not just given a list of slopes to find a sample standard deviation. The computation here is tedious, so we utilize technology to perform them. If one wished to compute the standard error for \(\beta_1\) by hand, we do so via:
where \(\hat{y}_i=\beta_1\cdot x_i+\beta_0\text{,}\) or the value of \(y\) predicted by the linear model.
Exploration 7.3.2. Hypothesis Testing: Slope - Possums.
Run the following code to download possum.csv
as seen in Exploration 7.1.1, Activity 7.1.7, and Exploration 7.2.1, and create a linear model for head_l~skull_w
and summarize it.
Coefficients
, in particular the second row.
The first row starting with (Intercept)
gives the point estimate, standard error, test statistic and \(p\)-value for the altnerative hypothesis \(H_A:\beta_0\neq 0\text{.}\)
The second row starting with skull_w
gives the point estimate, standard error, test statistic and \(p\)-value for the altnerative hypothesis \(H_A:\beta_1\neq 0\text{.}\) This is the row about slopes.
(a)
The first entry of this row gives us the point estimate for the slope. How does this compare to slope found by running:
We call this value \(\beta_1\text{.}\)(b)
The second entry of this row gives us the standard error for the slope. Call this value \(SE_1\text{.}\) Find the \(t\)-value for \(\beta_1\) by computing
How does this value compare to the third entry?
(c)
According to the summary, how many degrees of freedom are there?
(d)
Compute \(P(t>t_1)\) on the standard \(t\)-distribution with the appropriate degrees of freedom. Then double it to obtain the \(p\)-value for the alternative hypothesis \(H_A: \beta_1\neq 0\text{.}\)
Hint. Desmos
How does this value compare to the 4th entry?
(e)
Do we accept or reject the null hypothesis that \(H_0: \beta_1=0\text{?}\)
Remark 7.3.2.
The fourth entry of the second Coefficients:
row gives the probability that, if the slope was 0, that we would obtain a slope as steep or steeper.
The fourth entry of the first Coefficients:
row gives the probability that, if the intercept was 0, that we would obtain an as or more extreme.
Subsection 7.3.2 Confidence Intervals
Remark 7.3.3.
For either \(\beta_1\text{,}\) we compute the C% confidence interval as follows:
where \(t^*\) is the \(t\)-value such that \(P(-t^*\lt t\lt t^*)=C\%\) for the standard \(t\)-variable with the appropriate degrees of freedom.
Exploration 7.3.3. Confidence: Intervals:\(\beta_i\) - Possums.
We continue from Exploration 7.3.2.
(a)
Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.
(b)
Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)
(c)
Explain what this confidence interval means within the context of the problem.
Subsection 7.3.3 Putting it together
Activity 7.3.4. Inference for SP500 companies.
Run the following code to download sp500.csv
a data set comtaining information on a sample of 50 fortune 500 companies and show it's variable names.
https://www.openintro.org/data/index.php?data=sp500
.(a)
Run the following to create and summarize a linear model with debt
the debt in millions of dollars as the explanatory variable and profit_margin
the percent of earnings that is profit as the response variable:
(b)
Run the following to plot profit_margin
of these companies against the debt
and draw a regression line:
(c)
State what the slope \(\beta_1\) means in the context of this problem.
(d)
Interpret the \(p\)-value for the alternative hypothesis \(H_A:\beta_1\neq 0\) in the context of this problem.
(e)
Do we reject the null hypothesis that \(H_0:\beta_1=0\text{?}\) What does that say about the relationship between profit margin and debt?
(f)
Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.
(g)
Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)
(h)
Explain what this confidence interval means within the context of the problem.
Activity 7.3.5. Inference for Nutrition and Starbucks.
Run the following code to download starbucks.csv
a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.
https://www.openintro.org/data/index.php?data=starbucks
.(a)
Run the following to create and summarize a linear model with protein
the protein content of an item in g as the explanatory variable and calories
the calories of each item measure in, well calories, as the response variable:
(b)
Run the following to plot calories
of these items against the protein
and draw a regression line:
(c)
State what the slope \(\beta_1\) means in the context of this problem.
(d)
Interpret the \(p\)-value for the alternative hypothesis \(H_A:\beta_1\neq 0\) in the context of this problem.
(e)
Do we reject the null hypothesis that \(H_0:\beta_1=0\text{?}\) What does that say about the relationship between calories and protein?
(f)
Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.
(g)
Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)
(h)
Explain what this confidence interval means within the context of the problem.
(i)
Repeat this for any other pair of numerical variables.