Inference and Linear Regression (R3)

Section 7.3 Inference and Linear Regression (R3)

The slope and vertical intercept,

β_{1}, β_{0}

we computed in Section 7.2 are determined by a sample of (pairs of) random variables, and as such, are merely point estimates of these paramters. As usual, there is some chance we chose non-representative data from the sample.

🔗

In this section, we use our sample data to make inferences on the true parameters.

🔗

Exploration 7.3.1. Estimating Regression.

Let X and Y be variables where $Y = 0.8 X + 5 + ϵ$ is a distributed by a normal variable with mean 0 and standard deviation 3.

🔗

(a)

What line probably best fits the relationship between X and Y?

🔗

(b)

Run the following code to generate n=10 random X and Y values, plot them, find a regression line and the line in (a):


    
        
xxxxxxxxxx
 
1
n=10
2
X=runif(n, 0, 10)
3
Y=0.8*X+5+rnorm(n,0,3)
4
mod=lm(Y~X)
5
plot(X, Y, pch=19)
6
abline(mod)
7
abline(5, 0.8, col="blue", lty=2)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

Run it a few times, what do you notice?

🔗

(c)

Adjust n=100 and run it again, now what do you notice?

🔗

Subsection 7.3.1 Hypothesis Testing

🔗

Remark 7.3.1.

When it comes to linear regression, we focus on the following pair of hypothesis. :

$H_{0} : β_{1} = 0$ or the slope is zero. So changes in the explanatory variable do not result in average change in response variable.
$H_{0} : β_{1} \neq 0$ or the slope is not zero. So changes in the explanatory variable do result in average change in response variable.

This is a numerical hypothesis test similar to what we have done before. However we're not just given a list of slopes to find a sample standard deviation. The computation here is tedious, so we utilize technology to perform them. If one wished to compute the standard error for $β_{1}$ by hand, we do so via:

S E_{1} = \sqrt{\frac{1}{n - 2} \cdot \frac{\sum (y_{i} - {\hat{y}}_{i})^{2}}{\sum (x_{i} - \bar{x})^{2}}}

where ${\hat{y}}_{i} = β_{1} \cdot x_{i} + β_{0},$ or the value of $y$ predicted by the linear model.

🔗

Exploration 7.3.2. Hypothesis Testing: Slope - Possums.

Run the following code to download possum.csv as seen in Exploration 7.1.1, Activity 7.1.7, and Exploration 7.2.1, and create a linear model for head_l~skull_w and summarize it.


    
        
xxxxxxxxxx
 
1
possum = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/possum.csv")
2
possummod=lm(head_l~skull_w, data=possum)
3
summary(possummod)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

We're focused on the Coefficients, in particular the second row.

The first row starting with (Intercept) gives the point estimate, standard error, test statistic and $p$ -value for the altnerative hypothesis $H_{A} : β_{0} \neq 0 .$

The second row starting with skull_w gives the point estimate, standard error, test statistic and $p$ -value for the altnerative hypothesis $H_{A} : β_{1} \neq 0 .$ This is the row about slopes.

🔗

(a)

The first entry of this row gives us the point estimate for the slope. How does this compare to slope found by running:

We call this value

β_{1} .

🔗

(b)

The second entry of this row gives us the standard error for the slope. Call this value $S E_{1} .$ Find the $t$ -value for $β_{1}$ by computing

t_{1} = \frac{β_{1} - 0}{S E_{1}} .

How does this value compare to the third entry?

🔗

(c)

According to the summary, how many degrees of freedom are there?

🔗

(d)

Compute $P (t > t_{1})$ on the standard $t$ -distribution with the appropriate degrees of freedom. Then double it to obtain the $p$ -value for the alternative hypothesis $H_{A} : β_{1} \neq 0 .$

Hint. Desmos

How does this value compare to the 4th entry?

🔗

(e)

Do we accept or reject the null hypothesis that $H_{0} : β_{1} = 0 ?$

🔗

Remark 7.3.2.

The fourth entry of the second Coefficients: row gives the probability that, if the slope was 0, that we would obtain a slope as steep or steeper.

The fourth entry of the first Coefficients: row gives the probability that, if the intercept was 0, that we would obtain an as or more extreme.

🔗

Subsection 7.3.2 Confidence Intervals

🔗

Remark 7.3.3.

For either $β_{1},$ we compute the C% confidence interval as follows:

[point estimate for β_{1} - S E_{i} t^{*}, point estimate for β_{1} + S E_{i} t^{*}]

where $t^{*}$ is the $t$ -value such that $P (- t^{*} < t < t^{*}) = C %$ for the standard $t$ -variable with the appropriate degrees of freedom.

🔗

Exploration 7.3.3. Confidence: Intervals: $β_{i}$ - Possums.

We continue from Exploration 7.3.2.

🔗

(a)

Find a $t^{*}$ so that $P (- t^{*} < t < t^{*}) = C %$ for the appropriate degrees of freedom.

🔗

(b)

Use the point estimate and standard error for $β_{1},$ $t^{*}$ and Remark 7.3.3 to compute a 95% confidence interval for $β_{1} .$

🔗

(c)

Explain what this confidence interval means within the context of the problem.

🔗

Subsection 7.3.3 Putting it together

🔗

Activity 7.3.4. Inference for SP500 companies.

Run the following code to download sp500.csv a data set comtaining information on a sample of 50 fortune 500 companies and show it's variable names.


    
        
xxxxxxxxxx
 
1
sp500 = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/sp500.csv")
2
3
names(sp500)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=sp500.

🔗

(a)

Run the following to create and summarize a linear model with debt the debt in millions of dollars as the explanatory variable and profit_margin the percent of earnings that is profit as the response variable:


    
        
xxxxxxxxxx
 
1
sp500mod=lm(profit_margin~debt, data=sp500)
2
summary(sp500mod)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

(b)

Run the following to plot profit_margin of these companies against the debt and draw a regression line:


    
        
xxxxxxxxxx
 
1
plot(sp500$debt, sp500$profit_margin)
2
abline(sp500mod, col="red")

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

(c)

State what the slope $β_{1}$ means in the context of this problem.

🔗

(d)

Interpret the $p$ -value for the alternative hypothesis $H_{A} : β_{1} \neq 0$ in the context of this problem.

🔗

(e)

Do we reject the null hypothesis that $H_{0} : β_{1} = 0 ?$ What does that say about the relationship between profit margin and debt?

🔗

(f)

Find a $t^{*}$ so that $P (- t^{*} < t < t^{*}) = C %$ for the appropriate degrees of freedom.

🔗

(g)

Use the point estimate and standard error for $β_{1},$ $t^{*}$ and Remark 7.3.3 to compute a 95% confidence interval for $β_{1} .$

🔗

(h)

Explain what this confidence interval means within the context of the problem.

🔗

Activity 7.3.5. Inference for Nutrition and Starbucks.

Run the following code to download starbucks.csv a data set comtaining information about 77 Starbucks menu items their nutritional value and show it's variable names.


    
        
xxxxxxxxxx
 
1
starbucks = read.csv("https://github.com/TienChih/tbil-stats/raw/main/data/starbucks.csv")
2
3
names(starbucks)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

Click here to learn more about this data set: https://www.openintro.org/data/index.php?data=starbucks.

🔗

(a)

Run the following to create and summarize a linear model with protein the protein content of an item in g as the explanatory variable and calories the calories of each item measure in, well calories, as the response variable:


    
        
xxxxxxxxxx
 
1
starbucksmod=lm(calories~protein, data=starbucks)
2
summary(starbucksmod)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

(b)

Run the following to plot calories of these items against the protein and draw a regression line:


    
        
xxxxxxxxxx
 
1
plot(starbucks$protein, starbucks$calories)
2
abline(starbucksmod, col="red")

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

(c)

State what the slope $β_{1}$ means in the context of this problem.

🔗

(d)

Interpret the $p$ -value for the alternative hypothesis $H_{A} : β_{1} \neq 0$ in the context of this problem.

🔗

(e)

Do we reject the null hypothesis that $H_{0} : β_{1} = 0 ?$ What does that say about the relationship between calories and protein?

🔗

(f)

Find a $t^{*}$ so that $P (- t^{*} < t < t^{*}) = C %$ for the appropriate degrees of freedom.

🔗

(g)

Use the point estimate and standard error for $β_{1},$ $t^{*}$ and Remark 7.3.3 to compute a 95% confidence interval for $β_{1} .$

🔗

(h)

Explain what this confidence interval means within the context of the problem.

🔗

(i)

Repeat this for any other pair of numerical variables.

Section 7.3 Inference and Linear Regression (R3)

Exploration 7.3.1. Estimating Regression.

(a)

(b)

(c)

Subsection 7.3.1 Hypothesis Testing

Remark 7.3.1.

Exploration 7.3.2. Hypothesis Testing: Slope - Possums.

(a)

(b)

(c)

(d)

(e)

Remark 7.3.2.

Subsection 7.3.2 Confidence Intervals

Remark 7.3.3.

Exploration 7.3.3. Confidence: Intervals:βi - Possums.

(a)

(b)

(c)

Subsection 7.3.3 Putting it together

Activity 7.3.4. Inference for SP500 companies.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Activity 7.3.5. Inference for Nutrition and Starbucks.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Exploration 7.3.3. Confidence: Intervals: $β_{i}$ - Possums.