Skip to main content

Section 7.3 Inference and Linear Regression (R3)

The slope and vertical intercept, β1,β0 we computed in Section 7.2 are determined by a sample of (pairs of) random variables, and as such, are merely point estimates of these paramters. As usual, there is some chance we chose non-representative data from the sample.

In this section, we use our sample data to make inferences on the true parameters.

Exploration 7.3.1. Estimating Regression.

Let X and Y be variables where Y=0.8X+5+ϵ is a distributed by a normal variable with mean 0 and standard deviation 3.

(a)

What line probably best fits the relationship between X and Y?

(b)

Run the following code to generate n=10 random X and Y values, plot them, find a regression line and the line in (a):

Run it a few times, what do you notice?

(c)

Adjust n=100 and run it again, now what do you notice?

Subsection 7.3.1 Hypothesis Testing

Remark 7.3.1.

When it comes to linear regression, we focus on the following pair of hypothesis. :

  • H0:β1=0 or the slope is zero. So changes in the explanatory variable do not result in average change in response variable.

  • H0:β10 or the slope is not zero. So changes in the explanatory variable do result in average change in response variable.

This is a numerical hypothesis test similar to what we have done before. However we're not just given a list of slopes to find a sample standard deviation. The computation here is tedious, so we utilize technology to perform them. If one wished to compute the standard error for β1 by hand, we do so via:

SE1=1n2(yiy^i)2(xix¯)2

where y^i=β1xi+β0, or the value of y predicted by the linear model.

Exploration 7.3.2. Hypothesis Testing: Slope - Possums.

Run the following code to download possum.csv as seen in Exploration 7.1.1, Activity 7.1.7, and Exploration 7.2.1, and create a linear model for head_l~skull_w and summarize it.

We're focused on the Coefficients, in particular the second row.

The first row starting with (Intercept) gives the point estimate, standard error, test statistic and p-value for the altnerative hypothesis HA:β00.

The second row starting with skull_w gives the point estimate, standard error, test statistic and p-value for the altnerative hypothesis HA:β10. This is the row about slopes.

(a)

The first entry of this row gives us the point estimate for the slope. How does this compare to slope found by running:

We call this value β1.

(b)

The second entry of this row gives us the standard error for the slope. Call this value SE1. Find the t-value for β1 by computing

t1=β10SE1.

How does this value compare to the third entry?

(c)

According to the summary, how many degrees of freedom are there?

(d)

Compute P(t>t1) on the standard t-distribution with the appropriate degrees of freedom. Then double it to obtain the p-value for the alternative hypothesis HA:β10.

Hint. Desmos

How does this value compare to the 4th entry?

(e)

Do we accept or reject the null hypothesis that H0:β1=0?

Remark 7.3.2.

The fourth entry of the second Coefficients: row gives the probability that, if the slope was 0, that we would obtain a slope as steep or steeper.

The fourth entry of the first Coefficients: row gives the probability that, if the intercept was 0, that we would obtain an as or more extreme.

Subsection 7.3.2 Confidence Intervals

Remark 7.3.3.

For either β1, we compute the C% confidence interval as follows:

[point estimate for β1SEit,point estimate for β1+SEit]

where t is the t-value such that P(t<t<t)=C% for the standard t-variable with the appropriate degrees of freedom.

Exploration 7.3.3. Confidence: Intervals:βi - Possums.

We continue from Exploration 7.3.2.

(a)

Find a t so that P(t<t<t)=C% for the appropriate degrees of freedom.

(b)

Use the point estimate and standard error for β1, t and Remark 7.3.3 to compute a 95% confidence interval for β1.

(c)

Explain what this confidence interval means within the context of the problem.

Subsection 7.3.3 Putting it together

Activity 7.3.4. Inference for SP500 companies.

Run the following code to download sp500.csv a data set comtaining information on a sample of 50 fortune 500 companies and show it's variable names.

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=sp500.

(a)

Run the following to create and summarize a linear model with debt the debt in millions of dollars as the explanatory variable and profit_margin the percent of earnings that is profit as the response variable:

(b)

Run the following to plot profit_margin of these companies against the debt and draw a regression line:

(c)

State what the slope β1 means in the context of this problem.

(d)

Interpret the p-value for the alternative hypothesis HA:β10 in the context of this problem.

(e)

Do we reject the null hypothesis that H0:β1=0? What does that say about the relationship between profit margin and debt?

(f)

Find a t so that P(t<t<t)=C% for the appropriate degrees of freedom.

(g)

Use the point estimate and standard error for β1, t and Remark 7.3.3 to compute a 95% confidence interval for β1.

(h)

Explain what this confidence interval means within the context of the problem.

Activity 7.3.5. Inference for Nutrition and Starbucks.

Run the following code to download starbucks.csv a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=starbucks.

(a)

Run the following to create and summarize a linear model with protein the protein content of an item in g as the explanatory variable and calories the calories of each item measure in, well calories, as the response variable:

(b)

Run the following to plot calories of these items against the protein and draw a regression line:

(c)

State what the slope β1 means in the context of this problem.

(d)

Interpret the p-value for the alternative hypothesis HA:β10 in the context of this problem.

(e)

Do we reject the null hypothesis that H0:β1=0? What does that say about the relationship between calories and protein?

(f)

Find a t so that P(t<t<t)=C% for the appropriate degrees of freedom.

(g)

Use the point estimate and standard error for β1, t and Remark 7.3.3 to compute a 95% confidence interval for β1.

(h)

Explain what this confidence interval means within the context of the problem.

(i)

Repeat this for any other pair of numerical variables.