The outcome that I want to name Yuletide’s Rule, extra generally generally known as the “Frisch-Waugh-Lovell (FWL) theorem”, reveals the best way to calculate the regression slope coefficient for one predictor by finishing up extra “auxiliary” regressions that regulate for all different predictors.
You’ve most likely encountered this outcome if you happen to’ve studied introductory econometrics.
However it might shock you to study that there are literally two variants of the FWL theorem, every with its execs and cons.
Right now we’ll check out the much less acquainted model after which circle again to know what makes the extra acquainted one a textbook staple.
Simulation Instance
Let’s begin with just a little simulation.
First we’ll generate 5000 observations of predictors (X) and (W) from a joint regular distribution with normal deviations of 1, technique of zero, and a correlation of 0.5.
set.seed(1066)
library(mvtnorm)
# Simulate linear regression with two predictors: X and W
covariance_matrix <- matrix(
c(1, 0.5, 0.5, 1),
nrow = 2
)
n_sims <- 5000
x_w <- rmvnorm(
n = n_sims,
imply = c(0, 0),
sigma = covariance_matrix
)
x <- x_w[, 1]
w <- x_w[, 2]
Subsequent we’ll simulate the end result variable (Y) the place the true coefficient on (X) is one and the true coefficient on (W) is -1, including normal regular errors.
y <- 0.5 + x - w + rnorm(n_sims)
Now we’ll run the “auxiliary regressions”. The primary one regresses (X) on (W) and saves the residuals. Name these residuals x_tilde
.
# Residuals from regression of X on W
x_tilde <- lm(x ~ w) |>
residuals()
The following one regresses (Y) on (W) and saves the residuals. Name these residuals y_tilde
.
# Residuals from regression of Y on W
y_tilde <- lm(y ~ w) |>
residuals()
To make the code that follows just a little less complicated, I’ll additionally create a helper operate that runs a linear regression and returns the coefficients after stripping away any variable names.
get_coef <- operate(components) >
unname()
Now we’re prepared to match some regressions!
The “lengthy regression” is a typical linear regression of (Y) on (X) and (W).
The “FWL Customary” is a regression of y_tilde
on x_tilde
.
In different phrases, it regresses the residuals of (Y) on the residuals of (X).
The FWL as it’s often encountered in textbooks implies that we should always get well the identical coefficient on (X) in “Lengthy Regression” and in “FWL Customary”, and certainly the simulation bears this out.
c(
"Lengthy Regression" = get_coef(y ~ x + w)[2],
"FWL Customary" = get_coef(y_tilde ~ x_tilde - 1)[1],
"FWL Different" = get_coef(y ~ x_tilde)[2]
)
## Lengthy Regression FWL Customary FWL Different
## 0.9937046 0.9937046 0.9937046
However now check out “FWL” different: it is a regression of (Y) on x_tilde
.
In comparison with the usual FWL method, this model doesn’t residualize (Y) with respect to (W).
Nevertheless it nonetheless offers us precisely the identical coefficient on (X) as the opposite two regressions.
That leaves us with two unanswered questions:
- Why does the “different” FWL method work?
- Given that the choice method works, why does anybody ever train the “normal” model?
In the remainder of this submit we’ll reply each questions utilizing easy algebra and the properties of linear regression.
There are many deep concepts right here, however there’s no have to deliver out the massive matrix algebra weapons to clarify them.
A Little bit of Notation
First we want a little bit of notation.
I discover it a bit less complicated to work with inhabitants linear regressions slightly than pattern regressions, however the concepts are the identical both approach.
So if you happen to want to place “hats” on all the things and work with sums slightly than expectations and covariances, be my visitor!
First we’ll outline the “Lengthy Regression” as a inhabitants linear regression of (Y) on (X) and (W), particularly
[
Y = beta_0 + beta_X X + beta_W W + U, quad mathbb{E}(U) = text{Cov}(X,U) = text{Cov}(W,U)=0.
]
Subsequent I’ll outline two extra inhabitants linear regressions: first the regression of (X) on (W)
[
X = gamma_0 + gamma_W W + tilde{X}, quad mathbb{E}(tilde{X}) = text{Cov}(W,tilde{X})=0
]
and second the regression of (Y) on (W)
[
Y = delta_0 + delta_W W + tilde{Y}, quad mathbb{E}(tilde{Y}) = text{Cov}(W,tilde{Y})=0.
]
I’ve already linked to a submit making this level, however it bears repeating: all the properties of the error phrases (U), (tilde{X}) and (tilde{Y}) that I’ve said right here maintain by building.
They don’t seem to be assumptions; they’re merely what defines an error time period in a inhabitants linear regression.
Why does the “different” FWL method work?
As talked about within the dialogue of our simulation experiment from above, the usual FWL theorem says {that a} regression of (tilde{Y}) on (tilde{X}) with no intercept offers us (beta_X), whereas the different model says {that a} regression of (Y) on (tilde{X}) with an intercept additionally offers us (beta_X).
It’s the second declare that we’ll show now.1
The choice FWL theorem claims that (beta_X = textual content{Cov}(Y,tilde{X})/textual content{Var}(tilde{X})).
Since (tilde{X}) is uncorrelated with (W) by building, we are able to increase the numerator as follows:
[
text{Cov}(Y,tilde{X}) = text{Cov}(beta_0 + beta_X X + beta_W W + U, tilde{X}) = beta_X text{Cov}(X,tilde{X}) + text{Cov}(U,tilde{X}).
]
However since (tilde{X} = (X – gamma_0 – gamma_W W)) we even have
[
text{Cov}(U, tilde{X}) = text{Cov}(U, X – gamma_0 – gamma_W W) = text{Cov}(U,X) – gamma_W text{Cov}(U,W) = 0
]
since (X) and (W) are uncorrelated with (U) by building.
So to show our authentic declare it suffices to point out that (textual content{Cov}(X,tilde{X}) = textual content{Var}(tilde{X})).
To see why this holds, first write
[
text{Cov}(X, tilde{X}) = text{Cov}(X, X – gamma_0 – gamma_W W) = text{Var}(X) – gamma_W text{Cov}(X,W).
]
utilizing (textual content{Cov}(X,X) = textual content{Var}(X)).
Subsequent, increase (textual content{Var}(tilde{X})) as follows:
[
text{Var}(tilde{X}) = text{Var}(X – gamma_0 – gamma_W W) = text{Var}(X) + gamma_W^2 text{Var}(W) – 2 gamma_W text{Cov}(X,W).
]
after which subtract (textual content{Cov}(X,tilde{X})) from (textual content{Var}(tilde{X})):
[
text{Var}(tilde{X}) – text{Cov}(X,tilde{X}) = gamma_W left[ gamma_W text{Var}(W) – text{Cov}(X,W) right].
]
This reveals that (textual content{Var}(tilde{X})) and (textual content{Cov}(X,tilde{X})) are equal if and provided that (gamma_W textual content{Var}(W) = textual content{Cov}(X,W)).
However since (gamma_W) is the coefficient from the regression of (X) on (W), we already know that (gamma_W = textual content{Cov}(X,W)/textual content{Var}(W))!
With a little bit of algebra utilizing the properties of covariance and the definition of a inhabitants linear regression, we’ve proven that the choice FWL theorem holds.
What’s completely different in regards to the “ordinary” FWL theorem?
At this level you might be questioning why anybody teaches the “ordinary” model of the FWL theorem in any respect.
If that additional quick regression of (Y) on (W) isn’t wanted to study (beta_X), why trouble?
To reply this query, we’ll begin by re-writing the lengthy regression two other ways.
First, we’ll substitute (X = gamma_0 + gamma_W W + tilde{X}) into the lengthy regression and re-arrange, yielding
[
Y = (beta_0 + beta_X gamma_0) + beta_X tilde{X} + (beta_W + beta_X gamma_W) W + U.
]
Subsequent we’ll substitute (Y = delta_0 + delta_W W + tilde{Y}) on the left-hand aspect of the previous equation and rearrange to isolate (tilde{Y}).
This leaves us with
[
tilde{Y} = (beta_0 + beta_X gamma_0 – delta_0) + beta_X tilde{X} + (beta_W + beta_X gamma_W – delta_W) W + U.
]
Now we have now two expressions, every with (beta_X tilde{X}) as one of many phrases on the right-hand aspect and (U) as one other.
Discover that each expressions have an intercept and a time period wherein (W) is multiplied by a relentless.
What’s extra, the intercepts are carefully associated throughout the 2 equations, as are the (W) coefficients.
I’m now going to make a daring assertion: the intercept and (W) coefficient within the second expression, the (tilde{Y}) one, are each equal to zero
[
beta_0 + beta_X gamma_0 – delta_0 = 0, quad text{and} quad beta_W + beta_X gamma_W – delta_W = 0.
]
Maybe you don’t consider me, however only for the second suppose that I’m right.
On this case it could instantly comply with that
[
beta_0 + beta_X gamma_0 = delta_0, quad text{and} quad beta_W + beta_X gamma_W = delta_W
]
leaving us with two easy linear regressions, particularly
[
begin{align*}
Y &= delta_0 + beta_X tilde{X} + (beta_W W + U)
tilde{Y} &= beta_X tilde{X} + U.
end{align*}
]
We’re tantalizingly near unraveling the thriller of why the “ordinary” FWL theorem is so common.
However first we have to confirm my daring declare from the earlier paragraph.
To take action, we’ll fall again on our previous good friend: the omitted variable bias components, also called the regression anatomy components:
[
begin{aligned}
delta_W &equiv frac{text{Cov}(Y,W)}{text{Var}(W)} = frac{text{Cov}(beta_0 + beta_X X + beta_W W + U, W)}{text{Var}(W)} = frac{beta_W text{Var}(W) + beta_X text{Cov}(X,W)}{text{Var}(W)}
&= beta_W + beta_X frac{text{Cov}(X,W)}{text{Var}(W)} = beta_W + beta_X gamma_W.
end{aligned}
]
Thus, (beta_W + beta_X gamma_W – delta_W = 0) as claimed.
One down, another to go.
By definition, (delta_0 = mathbb{E}(Y) – delta_W mathbb{E}(W)).
Substituting the lengthy regression for (Y), we have now
[
begin{aligned}
delta_0 &= mathbb{E}(beta_0 + beta_X X + beta_W W + U) – delta_W mathbb{E}(W)
&= beta_0 + beta_X mathbb{E}(X) + (beta_W – delta_W) mathbb{E}(W)
end{aligned}
]
by the linearity of expectation and the truth that (mathbb{E}(U) = 0) by building.
Now, we’re attempting to point out that (delta_0 = beta_0 + beta_X gamma_0).
Substituting for (gamma_0) on this expression offers
[
beta_0 + beta_X gamma_0 = beta_0 + beta_X [mathbb{E}(X) – gamma_W mathbb{E}(W)] = beta_0 + beta_X mathbb{E}(X) – beta_X gamma_W mathbb{E}(W).
]
Inspecting our work to this point, we see that the 2 different expressions for (delta_0) might be equal exactly when (beta_X gamma_W = delta_W – beta_W).
However re-arranging this provides (delta_W = beta_W + beta_X gamma_W), which we already proved above utilizing the omitted variables bias components!
Taking Inventory
That was a number of algebra, so let’s spend a while serious about the outcomes.
We confirmed that
[
begin{align*}
Y &= delta_0 + beta_X tilde{X} + (beta_W W + U)
tilde{Y} &= beta_X tilde{X} + U.
end{align*}
]
Now, if you happen to’ll allow me, I’d wish to re-write that first equality as
[
Y = delta_0 + beta_X tilde{X} + V, quad text{where } V equiv beta_W W + U.
]
Since (tilde{X}) is uncorrelated with (U), as defined above, and since (mathbb{E}(U) = 0) by building, it follows that (tilde{Y} = beta_X tilde{X} + U) is a bona fide inhabitants linear regression mannequin.
If we regress (tilde{Y}) on (tilde{X}) the slope coefficient might be (beta_X) and the error time period might be (U).
This regression corresponds to the normal FWL theorem.
Discover that it has an intercept of zero and an error time period that’s equivalent to that of the lengthy regression.
We are able to confirm this utilizing our simulation experiment from above as follows:
# Customary FWL has identical residuals as lengthy regression
u_hat <- resid(lm(y ~ x + w))
u_tilde <- resid(lm(y_tilde ~ x_tilde - 1))
all.equal(u_hat, u_tilde)
## [1] TRUE
# Customary FWL has an intercept of zero (to machine precision!)
coef(lm(y_tilde ~ x_tilde))[1] # match with intercept; verify it's (numerically) 0
## (Intercept)
## 6.260601e-18
So what about (Y = delta_0 + beta_X tilde{X} + V)?
That is the regression that corresponds to the different FWL theorem.
Since (V = beta_W W + U) and (tilde{X}) is uncorrelated with each (U) and (W), this too is a inhabitants regression.
However except (beta_W = 0), it has a completely different error time period.
In different phrases, (V neq U).
Furthermore, this regression consists of an intercept that’s not typically zero.
Once more we are able to confirm this utilizing our simulation instance from above:
# Different FWL has completely different residuals than lengthy regression
v_hat <- resid(lm(y ~ x_tilde))
all.equal(u_hat, v_hat)
## [1] "Imply relative distinction: 0.4905107"
# Different FWL has a non-zero intercept
coef(lm(y ~ x_tilde))[1]
## (Intercept)
## 0.4878453
The Punchline
In case your purpose is merely to study (beta_X), then both model of the FWL theorem will do the trick and the choice model is less complicated as a result of it solely entails one auxiliary regression as a substitute of two.
However if you wish to make sure that you find yourself with the identical error time period as within the authentic lengthy regression, then you should use the normal model of the FWL theorem.
That is essential for the needs of inference as a result of the properties of the error time period decide the usual errors of your estimates.
-
Concern not: we’ll return to the primary declare quickly!↩︎