Skip to main content

Section 7.3 Inference and Linear Regression (R3)

The slope and vertical intercept, \(\beta_1, \beta_0\) we computed in Section 7.2 are determined by a sample of (pairs of) random variables, and as such, are merely point estimates of these paramters. As usual, there is some chance we chose non-representative data from the sample.

In this section, we use our sample data to make inferences on the true parameters.

Exploration 7.3.1. Estimating Regression.

Let X and Y be variables where \(Y=0.8X+5+\epsilon\) is a distributed by a normal variable with mean 0 and standard deviation 3.

(a)

What line probably best fits the relationship between X and Y?

(b)

Run the following code to generate n=10 random X and Y values, plot them, find a regression line and the line in (a):

Run it a few times, what do you notice?

(c)

Adjust n=100 and run it again, now what do you notice?

Subsection 7.3.1 Hypothesis Testing

Remark 7.3.1.

When it comes to linear regression, we focus on the following pair of hypothesis. :

  • \(H_0: \beta_1=0\) or the slope is zero. So changes in the explanatory variable do not result in average change in response variable.

  • \(H_0: \beta_1\neq 0\) or the slope is not zero. So changes in the explanatory variable do result in average change in response variable.

This is a numerical hypothesis test similar to what we have done before. However we're not just given a list of slopes to find a sample standard deviation. The computation here is tedious, so we utilize technology to perform them. If one wished to compute the standard error for \(\beta_1\) by hand, we do so via:

\begin{equation*} SE_1=\sqrt{\frac{1}{n-2}\cdot \frac{\sum (y_i-\hat{y}_i)^2}{\sum(x_i-\bar{x})^2}} \end{equation*}

where \(\hat{y}_i=\beta_1\cdot x_i+\beta_0\text{,}\) or the value of \(y\) predicted by the linear model.

Exploration 7.3.2. Hypothesis Testing: Slope - Possums.

Run the following code to download possum.csv as seen in Exploration 7.1.1, Activity 7.1.7, and Exploration 7.2.1, and create a linear model for head_l~skull_w and summarize it.

We're focused on the Coefficients, in particular the second row.

The first row starting with (Intercept) gives the point estimate, standard error, test statistic and \(p\)-value for the altnerative hypothesis \(H_A:\beta_0\neq 0\text{.}\)

The second row starting with skull_w gives the point estimate, standard error, test statistic and \(p\)-value for the altnerative hypothesis \(H_A:\beta_1\neq 0\text{.}\) This is the row about slopes.

(a)

The first entry of this row gives us the point estimate for the slope. How does this compare to slope found by running:

We call this value \(\beta_1\text{.}\)

(b)

The second entry of this row gives us the standard error for the slope. Call this value \(SE_1\text{.}\) Find the \(t\)-value for \(\beta_1\) by computing

\begin{equation*} t_1=\frac{\beta_1-0}{SE_1}. \end{equation*}

How does this value compare to the third entry?

(c)

According to the summary, how many degrees of freedom are there?

(d)

Compute \(P(t>t_1)\) on the standard \(t\)-distribution with the appropriate degrees of freedom. Then double it to obtain the \(p\)-value for the alternative hypothesis \(H_A: \beta_1\neq 0\text{.}\)

Hint. Desmos

How does this value compare to the 4th entry?

(e)

Do we accept or reject the null hypothesis that \(H_0: \beta_1=0\text{?}\)

Remark 7.3.2.

The fourth entry of the second Coefficients: row gives the probability that, if the slope was 0, that we would obtain a slope as steep or steeper.

The fourth entry of the first Coefficients: row gives the probability that, if the intercept was 0, that we would obtain an as or more extreme.

Subsection 7.3.2 Confidence Intervals

Remark 7.3.3.

For either \(\beta_1\text{,}\) we compute the C% confidence interval as follows:

\begin{equation*} [\text{point estimate for }\beta_1 - SE_it^*, \text{point estimate for }\beta_1 + SE_it^*] \end{equation*}

where \(t^*\) is the \(t\)-value such that \(P(-t^*\lt t\lt t^*)=C\%\) for the standard \(t\)-variable with the appropriate degrees of freedom.

Exploration 7.3.3. Confidence: Intervals:\(\beta_i\) - Possums.

We continue from Exploration 7.3.2.

(a)

Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.

(b)

Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)

(c)

Explain what this confidence interval means within the context of the problem.

Subsection 7.3.3 Putting it together

Activity 7.3.4. Inference for SP500 companies.

Run the following code to download sp500.csv a data set comtaining information on a sample of 50 fortune 500 companies and show it's variable names.

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=sp500.

(a)

Run the following to create and summarize a linear model with debt the debt in millions of dollars as the explanatory variable and profit_margin the percent of earnings that is profit as the response variable:

(b)

Run the following to plot profit_margin of these companies against the debt and draw a regression line:

(c)

State what the slope \(\beta_1\) means in the context of this problem.

(d)

Interpret the \(p\)-value for the alternative hypothesis \(H_A:\beta_1\neq 0\) in the context of this problem.

(e)

Do we reject the null hypothesis that \(H_0:\beta_1=0\text{?}\) What does that say about the relationship between profit margin and debt?

(f)

Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.

(g)

Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)

(h)

Explain what this confidence interval means within the context of the problem.

Activity 7.3.5. Inference for Nutrition and Starbucks.

Run the following code to download starbucks.csv a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=starbucks.

(a)

Run the following to create and summarize a linear model with protein the protein content of an item in g as the explanatory variable and calories the calories of each item measure in, well calories, as the response variable:

(b)

Run the following to plot calories of these items against the protein and draw a regression line:

(c)

State what the slope \(\beta_1\) means in the context of this problem.

(d)

Interpret the \(p\)-value for the alternative hypothesis \(H_A:\beta_1\neq 0\) in the context of this problem.

(e)

Do we reject the null hypothesis that \(H_0:\beta_1=0\text{?}\) What does that say about the relationship between calories and protein?

(f)

Find a \(t^*\) so that \(P(-t^*\lt t\lt t^*)=C\%\) for the appropriate degrees of freedom.

(g)

Use the point estimate and standard error for \(\beta_1\text{,}\) \(t^*\) and Remark 7.3.3 to compute a 95% confidence interval for \(\beta_1\text{.}\)

(h)

Explain what this confidence interval means within the context of the problem.

(i)

Repeat this for any other pair of numerical variables.