Suppose I run a easy linear regression of an final result variable on a predictor variable. If I save the fitted values from this regression after which run a second regression of the end result variable on the fitted values, what is going to I get? For further credit score: how will the R-squared from the second regression evaluate to that from the primary regression?
Instance: Peak and Handspan
Right here’s a easy instance: a regression of top, measured in inches, on handspan, measured in centimeters.
library(tidyverse)
library(broom)
dat <- read_csv('https://ditraglia.com/knowledge/height-handspan.csv')
ggplot(dat, aes(y = top, x = handspan)) +
geom_point(alpha = 0.2) +
geom_smooth(methodology = "lm", coloration = "purple") +
labs(y = "Peak (in)", x = "Handspan (cm)")
# Match the regression
reg1 <- lm(top ~ handspan, knowledge = dat)
tidy(reg1)
## # A tibble: 2 × 5
## time period estimate std.error statistic p.worth
##
## 1 (Intercept) 40.9 1.67 24.5 9.19e-76
## 2 handspan 1.27 0.0775 16.3 3.37e-44
As anticipated, larger individuals are larger in all dimensions, on common, so we see a optimistic relationship between handspan and top. Now let’s save the fitted values from this regression and run a second regression of top on the fitted values:
dat <- reg1 |>
increase(dat)
reg2 <- lm(top ~ .fitted, knowledge = dat)
tidy(reg2)
## # A tibble: 2 × 5
## time period estimate std.error statistic p.worth
##
## 1 (Intercept) -1.76e-13 4.17 -4.23e-14 1.000e+ 0
## 2 .fitted 1.00e+ 0 0.0612 1.63e+ 1 3.37 e-44
The intercept isn’t fairly zero, however it’s about as shut as we will moderately anticipate to get on a pc and the slope is precisely one. Now how in regards to the R-squared? Let’s examine:
look(reg1)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.worth df logLik AIC BIC
##
## 1 0.452 0.450 3.02 267. 3.37e-44 1 -822. 1650. 1661.
## # ℹ 3 extra variables: deviance , df.residual , nobs
look(reg2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.worth df logLik AIC BIC
##
## 1 0.452 0.450 3.02 267. 3.37e-44 1 -822. 1650. 1661.
## # ℹ 3 extra variables: deviance , df.residual , nobs
The R-squared values from the 2 regressions are similar! Stunned? Now’s your final likelihood to assume it by way of by yourself earlier than I give my resolution.
Resolution
Suppose we wished to decide on (alpha_0) and (alpha_1) to attenuate (sum_{i=1}^n (Y_i – alpha_0 – alpha_1 widehat{Y}_i)^2) the place (widehat{Y}_i = widehat{beta}_0 + widehat{beta}_1 X_i). That is equal to minimizing
[
sum_{i=1}^n left[Y_i – (alpha_0 + alpha_1 widehat{beta}_0) – (alpha_1widehat{beta}_1)X_iright]^2.
]
By building (widehat{beta}_0) and (widehat{beta}_1) reduce (sum_{i=1}^n (Y_i – beta_0 – beta_1 X_i)^2), so until (widehat{alpha_0} = 0) and (widehat{alpha_1} = 1) we’d have a contradiction!
Comparable reasoning explains why the R-squared values for the 2 regressions are the identical. The R-squared of a regression equals (1 – textual content{SS}_{textual content{residual}} / textual content{SS}_{textual content{whole}})
[
text{SS}_{text{total}} = sum_{i=1}^n (Y_i – bar{Y})^2,quad
text{SS}_{text{residual}} = sum_{i=1}^n (Y_i – widehat{Y}_i)^2
]
The whole sum of squares is similar for each regressions as a result of they’ve the identical final result variable. The residual sum of squares is similar as a result of (widehat{alpha}_0 = 0) and (widehat{alpha}_1 = 1) collectively suggest that each regressions have the identical fitted values.
Right here I targeted on the case of a easy linear regression, one with a single predictor variable, however the identical primary concept holds normally.