By the tip of a typical introductory econometrics course college students have turn into accustomed to the thought of “controlling” for covariates by including them to the tip of a linear regression mannequin. However this familiarity can typically trigger confusion when college students later encounter regression adjustment, a widely-used strategy to causal inference underneath the selection-on-observables assumption. Whereas regression adjustment is straightforward in idea, the finer factors of how and when to use it in follow are far more delicate. Certainly one of these finer factors is the way to inform whether or not a selected covariate is a “good management” that can assist us be taught the causal impact of curiosity or a “unhealthy management” that can solely make issues worse. One other, and the subject of in the present day’s publish, is the way to truly implement regression adjustment after we’ve determined which covariates to regulate for.
The pre-requisites for this publish are a primary understanding of selection-on-observables and regression adjustment. In the event you’re a bit rusty on these factors, you may discover it useful to look on the first half of my lecture slides together with this collection of brief movies. In the event you’re nonetheless hungry for extra after this, you may also take pleasure in this earlier publish from econometrics.weblog on widespread misunderstandings in regards to the selection-on-observables assumption.
A Fast Evaluate
Contemplate a binary remedy (D) and an noticed consequence (Y). Let ((Y_0, Y_1)) be the potential outcomes similar to the remedy (D). Our purpose is to be taught the common remedy impact (textual content{ATE} equiv mathbb{E}(Y_1 – Y_0)) however, except (D) is randomly assigned, utilizing the distinction of noticed means (mathbb{E}(Y|D=1) – mathbb{E}(Y|D=0)) to estimate the ATE usually gained’t work. The thought of selection-on-observables is that (D) is perhaps “nearly as good as randomly assigned” after we modify for a set of noticed covariates (X).
Regression adjustment depends on two assumptions: selection-on-observables and overlap. The choice-on-observables assumption says that studying (D) gives no further details about the common values of (Y_0) and (Y_1), supplied that we already know (X). This suggests that we are able to be taught the conditional common remedy impact (CATE) by evaluating noticed outcomes of the handled and untreated holding (X) fastened:
[
text{CATE}(x) equiv mathbb{E}[Y_1 – Y_0|X = x] = mathbb{E}[Y|D=1, X = x] – mathbb{E}[Y|D=0, X = x].
]
For instance: older folks is perhaps extra more likely to take a brand new remedy but in addition extra more likely to die with out it. If that’s the case, maybe by evaluating common outcomes holding age fastened we are able to be taught the causal impact of the remedy.
The overlap assumption says that, for any fastened worth (x) of the covariates, there are some handled and a few untreated folks. This permits us to be taught (textual content{CATE}(x)) for each worth of (x) within the inhabitants and common it utilizing the legislation of iterated expectations to recuperate the ATE:
[
text{ATE} = mathbb{E}[text{CATE}(X)] = mathbb{E}[mathbb{E}(Y|D=1, X) – mathbb{E}(Y|D=0, X)].
]
Within the remedy instance, this may correspond to computing the distinction of means for every age group individually, after which averaging them utilizing the share of individuals in every age group. Discover that that is solely doable if there are some individuals who took the remedy and a few who didn’t in every age group. That’s precisely what the overlap assumption buys us. For instance, if there have been no senior residents who didn’t take the remedy, we wouldn’t be capable of be taught the impact of the remedy for senior residents.
Which regression ought to we run?
So suppose that we’ve discovered a set of covariates (X) that fulfill the required assumptions. How ought to we truly perform regression adjustment? To reply this query, let’s begin by making issues a bit less complicated. Suppose that (X) is a single binary covariate. On the finish of the publish, we’ll return to the final case. Since (X) and (D) are each binary, we are able to write the conditional imply perform of (Y) given ((D, X)) as
[
mathbb{E}(Y|D, X) = beta_0 + beta_1 D + beta_2 X + beta_3 DX.
]
For the reason that true conditional imply perform is linear, a linear regression of (Y) on (D), (X), (DX) and an intercept will recuperate ((beta_0, beta_1, beta_2, beta_3)).
However what on earth do these coefficients truly imply?! Substituting all doable values of ((D, X)),
[
begin{align*}
mathbb{E}(Y|D=0, X=0) &= beta_0
mathbb{E}(Y|D=1, X=0) &= beta_0 + beta_1
mathbb{E}(Y|D=0, X=1) &= beta_0 + beta_2
mathbb{E}(Y|D=1, X=1) &= beta_0 + beta_1 + beta_2 + beta_3.
end{align*}
]
And so, after a little bit of re-arranging,
[
begin{align*}
beta_0 &= mathbb{E}(Y|D=0, X=0)
beta_1 &= mathbb{E}(Y|D=1, X=0) – mathbb{E}(Y|D=0, X=0)
beta_2 &= mathbb{E}(Y|D=0, X=1) – mathbb{E}(Y|D=0, X=0)
beta_3 &= mathbb{E}(Y|D=1, X=1) – mathbb{E}(Y|D=1, X=0) – mathbb{E}(Y|D=0, X=1) + mathbb{E}(Y|D=0, X=0).
end{align*}
]
What a multitude! Alas, we’ll want a couple of extra steps of algebra to determine how these relate to the ATE. Discover that (beta_1) equals the CATE when (X=0) since
[
begin{align*}
text{CATE}(0) &equiv mathbb{E}(Y|D=1, X=0) – mathbb{E}(Y|D=0, X=0)
&= (beta_0 + beta_1) – beta_0
& = beta_1
end{align*}
]
Continuing equally for the CATE when (X = 1), we discover that
[
begin{align*}
text{CATE}(1) &equiv mathbb{E}(Y|D=1, X=1) – mathbb{E}(Y|D=0, X=1)
&= (beta_0 + beta_1 + beta_2 + beta_3) – (beta_0 + beta_2)
&= beta_1 + beta_3.
end{align*}
]
Now that we have now expressions for every of the 2 conditional common remedy results, corresponding to every of the values that (X) can take, we’re lastly able to compute the ATE:
[
begin{align*}
text{ATE} &= mathbb{E}[text{CATE}(X)]
&= textual content{CATE}(0) instances mathbb{P}(X = 0) + textual content{CATE}(1) instances mathbb{P}(X = 1)
&= beta_1 left[1 – mathbb{P}(X = 1)right] + (beta_1 + beta_3) mathbb{P}(X = 1)
&= beta_1 + beta_3 p
finish{align*}
]
the place we outline the shorthand (p equiv mathbb{P}(X=1)). So to compute the ATE, we have to know the coefficients (beta_1) and (beta_3) from the regression of (Y) on (D), (X), and (DX), as well as to the share of individuals with (X = 1). Evidently, your favourite regression package deal won’t spit out the ATE for you in case you run the regression from above. And it actually gained’t spit out the usual error! So what can we do apart from computing every thing by hand?
Two Easy Options
It seems that there are two easy methods to get the your favourite software program package deal to spit out the ATE for you and related commonplace error. Every includes a slight re-parameterization of the conditional imply expression from above. The primary one replaces (DX) with (Dtilde{X}) the place (tilde{X} equiv X – p) and (p equiv mathbb{P}(X=1)). To see why this works, discover that
[
begin{align*}
mathbb{E}(Y|D, X) &= beta_0 + beta_1 D + beta_2 X + beta_3 DX
&= beta_0 + beta_1 D + beta_2 X + beta_3 D(X – p) + beta_3 pD
&= beta_0 + (beta_1 + beta_3 p) D + beta_2 X + beta_3 Dtilde{X}
&= beta_0 + text{ATE}times D + beta_2 X + beta_3 Dtilde{X}.
end{align*}
]
This works completely nicely, however there’s one thing about it that offends my sense of order: why subtract the imply from (X) in one place however not in one other? In the event you share my aesthetic sensibilities, then you possibly can be happy to interchange that offending (X) with one other (tilde{X}) since
[
begin{align*}
mathbb{E}(Y|D, X) &= beta_0 + text{ATE}times D + beta_2 X + beta_3 Dtilde{X}
&= beta_0 + text{ATE}times D + beta_2 (X-p) + p beta_2 + beta_3 Dtilde{X}
&= (beta_0 + p beta_2) + text{ATE}times D + beta_2 tilde{X} + beta_3 Dtilde{X}
&= tilde{beta}_0 + text{ATE}times D + beta_2 tilde{X} + beta_3 Dtilde{X}
end{align*}
]
the place we outline (tilde{beta}_0 equiv beta_0 + p beta_2). Discover that the one coefficient that adjustments is the intercept, and we’re usually not on this anyway!
What if we ignore the interplay?
Wait a minute, chances are you’ll be able to object, when researchers declare to be “adjusting” or “controlling” for (X) in follow, they very hardly ever embrace an interplay time period between (D) and (X) of their regression! As an alternative, they simply regress (Y) on (D) and (X). What can we are saying about this strategy? To reply this query, let’s proceed with our instance from above and outline the next inhabitants linear regression mannequin:
[
Y = alpha_0 + alpha_1 D + alpha_2 X + V
]
the place (U) is the inhabitants linear regression error time period in order that, by development, (mathbb{E}(U) = mathbb{E}(XU) = 0). Discover that I’ve known as the coefficients on this regression (alpha) slightly than (beta). That’s as a result of they may not usually coincide with the conditional imply perform from above, particularly (mathbb{E}(Y|D, X) = beta_0 + beta_1 D + beta_2 X + beta_3 DX). Particularly, the regression of (Y) on (D) and (X) with out an interplay will solely coincide with the true conditional imply perform if (beta_3 = 0).
So what, if something, can we are saying about (alpha_1) in relation to the ATE? By Xmas’s Rule we have now
[
alpha_1 = frac{text{Cov}(Y, tilde{D})}{text{Var}(tilde{D})}, quad
D = gamma_0 + gamma_1 X + tilde{D}, quad mathbb{E}(tilde{D}) = mathbb{E}(Xtilde{D}) = 0
]
the place (tilde{D}) is the error time period from a inhabitants linear regression of (D) on (X). In phrases, the way in which {that a} regression of (Y) on (D) and (X) “adjusts” for (X) is by first regressing (D) on (X), taking the a part of (D) that’s not correlated with (X), particularly (tilde{D}), and regressing (Y) on this alone. As proven within the appendix to this publish,
[
frac{text{Cov}(Y,tilde{D})}{text{Var}(tilde{D})} = frac{mathbb{E}[text{Var}(D|X)(beta_1 + beta_3 X)]}{mathbb{E}[text{Var}(D|X)]}.
]
on this instance. And since (textual content{CATE}(X) = beta_1 + beta_3 X) it follows that
[
alpha_1 = frac{mathbb{E}[text{Var}(D|X) cdot text{CATE}(X)]}{mathbb{E}[text{Var}(D|X)]}.
]
The one factor that’s random on this expression is (X). Each expectations contain averaging over its distribution. To make this clearer, outline the propensity rating (pi(x) equiv mathbb{P}(D=1|X=x)). Utilizing this notation,
[
begin{align*}
text{Var}(D|X) &= mathbb{E}(D^2|X) – mathbb{E}(D|X)^2 = mathbb{E}(D|X) – mathbb{E}(D|X)^2
&= pi(X) – pi(X)^2 = pi(X)[1 – pi(X)]
finish{align*}
]
since (D) is binary. Defining (p(x) equiv mathbb{P}(X = x)), we see that
[
begin{align*}
alpha_1 &= frac{mathbb{E}[pi(X){1 – pi(X)}cdot text{CATE}(X)]}{mathbb{E}[pi(X){1 – pi(X)}]}
&= frac{p(0) cdot pi(0)[1 – pi(0)]cdot textual content{CATE}(0) + p(1) cdot pi(1)[1 – pi(1)]cdot textual content{CATE}(1)}{p(0) cdot pi(0)[1 – pi(0)] + p(1) cdot pi(1)[1 – pi(1)]}
&= w_0 cdot textual content{CATE}(0) + w_1 cdot textual content{CATE}(1)
finish{align*}
]
the place we introduce the shorthand
[
w(x) equiv frac{p(x) cdot pi(x)[1 – pi(x)]}{sum_{textual content{all } okay} p(okay) cdot pi(okay)[1 – pi(k)]}.
]
In different phrases, the coefficient on (D) in a regression of (Y) on (D) and (X) excluding the interplay time period (DX) offers a weighted common of the conditional common remedy results for the completely different values of (X). The weights are between zero and one and sum to at least one. As a result of (w(x)) is rising in (p(x)), values of (X) which can be extra widespread are given extra weight simply as they’re within the ATE. However since (w(x)) is additionally rising in (pi(x)[1 – pi(x)]), values of (X) for which (pi(x)) is nearer to 0.5 are given extra weight, not like within the ATE. As such, we may describe (alpha_1) as a variance-weighted common of the conditional common remedy results.
Typically, the weighted common (alpha_1) will not coincide with the ATE, though there are two particular instances the place it’s going to. The primary case is when (textual content{CATE}(X)) doesn’t rely on (X), i.e. remedy results are homogeneous. On this case (beta_3 = 0) so there is not any interplay time period within the conditional imply perform! The second is when (pi(X)) doesn’t rely on (X), during which case the likelihood of remedy doesn’t rely on (X), so we don’t want to regulate for (X) within the first place!
What in regards to the common case?
The entire above derivations assumed that (X) is one-dimensional and binary. So how a lot of this nonetheless applies extra typically? First, if (X) is a vector of binary variables representing classes like intercourse, race and many others., every thing goes by means of precisely as above. All that adjustments is that (beta_2), (beta_3) and (p = mathbb{E}(X)) turn into vectors. The coefficient on (D) in a regression of (Y) on (D), (X) and the interplay (D tilde{X}) remains to be the ATE, and the coefficient on (D) in a regression that excludes the interplay time period remains to be a weighted common of CATEs that does not usually equal the ATE.
So at any time when the covariates you have to modify for are categorical, this publish has you coated. However what if a few of our covariates are steady? On this case issues are a bit extra difficult, however all the outcomes from above nonetheless undergo if we’re prepared to imagine that the conditional imply features (mathbb{E}(Y|D=0, X)), (mathbb{E}(Y|D=1,X)) and (mathbb{E}(D|,X)) are linear in (X). That is undoubtedly a powerful assumption, however not maybe as robust because it appears. For instance, (X) may embrace logs, squares or different features of some underlying steady covariates, e.g. age or years of expertise. On this case, the weighted common interpretation of the coefficient on (D) in a regression that excludes the interplay time period nonetheless holds however now includes an integral slightly than a sum.
Does it actually work? An Empirical Instance
However maybe you don’t belief my algebra. To assuage your fears, let’s take this to the information! The next instance relies on Peisakhin & Rozenas (2018) – Electoral Results of Biased Media: Russian Tv in Ukraine. I’ve tailored it from Llaudet and Imai’s unbelievable e-book Knowledge Evaluation for Social Science, the proper vacation or birthday present for the budding social scientist in your life.
Right here’s a little bit of background. Within the lead-up to Ukraine’s 2014 parliamentary election, Russian state-controlled TV mounted a fierce media marketing campaign in opposition to the Ukrainian authorities. Ukrainians who lived close to the border with Russia may doubtlessly obtain Russian TV indicators. Did receiving these indicators trigger them to help pro-Russia events within the election? To reply this query, we’ll use a dataset known as precincts
that comprises mixture election ends in precincts near the Russian border:
library(tidyverse)
precincts <- read_csv('https://ditraglia.com/information/UA_precincts.csv')
Every row of precincts
is an electoral precinct in Ukraine that’s close to the Russian border. The columns pro_russion
and prior_pro_russian
give the vote share (in share factors) of pro-Russian events within the 2014 and 2012 Ukrainian elections, respectively. Our consequence of curiosity would be the change in pro-Russian vote share between the 2 elections, so we first have to assemble this:
precincts <- precincts |>
mutate(change = pro_russian - prior_pro_russian) |>
choose(-pro_russian, -prior_pro_russian)
precincts
## # A tibble: 3,589 × 3
## russian_tv within_25km change
##
## 1 0 1 -22.4
## 2 0 0 -34.5
## 3 1 1 -18.8
## 4 0 1 -12.2
## 5 0 0 -27.7
## 6 1 0 -44.2
## 7 0 0 -34.5
## 8 0 0 -29.5
## 9 0 0 -24.1
## 10 0 0 -25.4
## # ℹ 3,579 extra rows
The column russian_tv
equals 1
if the precinct has Russian TV reception. That is our remedy variable: (D). However crucially, that is not randomly assigned. Whereas it’s true that there’s some pure variation in sign energy that’s plausibly unbiased of different elements associated to voting habits, on common precincts nearer to Russia usually tend to obtain a sign. So suppose for the sake of argument that conditional on proximity to the Russian border, russian_tv
is nearly as good as randomly assigned. That is the choice on observables assumption. There’s no option to examine this utilizing our information alone. It’s one thing we have to justify primarily based on our understanding of the world and the substantive drawback at hand.
As our measure of proximity, we’ll use the dummy variable within_25km
which equals 1
if the precinct is inside 25km of the Russian border. This our (X)-variable. The overlap assumption requires that there are some precincts with Russian TV reception and a few with out in every distance class. That is an assumption that we can examine utilizing the information, so let’s accomplish that earlier than continuing:
precincts |>
group_by(within_25km) |>
summarize(`share with Russion television` = imply(russian_tv))
## # A tibble: 2 × 2
## within_25km `share with Russion television`
##
## 1 0 0.105
## 2 1 0.692
We see that simply over 10% of which can be not inside 25km of the border have Russian TV reception whereas just below 70% of these inside 25km have reception, so overlap is glad on this instance. Neither of those values is near 0% or 100%, so this dataset comfortably satisfies the overlap assumption.
To keep away from taxing your reminiscence about which variable is which, for the remainder of this train, I’ll create a brand new dataset that renames the columns of precincts
to D
, X
, and Y
for the remedy, covariate, and consequence, respectively.
dat <- precincts |>
rename(D = russian_tv, X = within_25km, Y = change)
Computing the ATE the Exhausting Method
Now we’re able to confirm the calculations from above. First we’ll compute the ATE “the exhausting method”, in different phrases by computing every of the CATEs individually and averaging them. Warning: there’s a good bit of dplyr
to return!
# Step 1: compute the imply Y for every mixture of (D, X)
means <- dat |>
group_by(D, X) |>
summarize(Ybar = imply(Y))
means # show the outcomes
## # A tibble: 4 × 3
## # Teams: D [2]
## D X Ybar
##
## 1 0 0 -24.6
## 2 0 1 -34.2
## 3 1 0 -13.0
## 4 1 1 -32.2
# Step 2: reshape so the technique of Y|D=0,X and Y|D=1,X are in separate cols
means <- means |>
pivot_wider(names_from = D,
values_from = Ybar,
names_prefix = 'Ybar')
means # show the outcomes
## # A tibble: 2 × 3
## X Ybar0 Ybar1
##
## 1 0 -24.6 -13.0
## 2 1 -34.2 -32.2
# Step 3: connect a column with the proportion of X = 0 and X = 1
regression_adjustment <- dat |>
group_by(X) |>
summarize(depend = n()) |>
mutate(p = depend / sum(depend)) |>
choose(-count) |>
left_join(means) |>
mutate(CATE = Ybar1 - Ybar0) # compute the CATEs
regression_adjustment # show the outcomes
## # A tibble: 2 × 5
## X p Ybar0 Ybar1 CATE
##
## 1 0 0.849 -24.6 -13.0 11.6
## 2 1 0.151 -34.2 -32.2 2.01
# Step 4: in the end, compute the ATE!
ATE <- regression_adjustment |>
mutate(out = (Ybar1 - Ybar0) * p) |>
pull(out) |>
sum()
ATE
## [1] 10.12062
Computing the ATE the Straightforward Method
And now the simple method, utilizing the 2 regressions described above
# Assemble Xtilde = X - imply(X)
dat <- dat |>
mutate(Xtilde = X - imply(X))
# Regression of Y on D, X, and D*Xtilde
lm(Y ~ D + X + D:Xtilde, dat)
##
## Name:
## lm(components = Y ~ D + X + D:Xtilde, information = dat)
##
## Coefficients:
## (Intercept) D X D:Xtilde
## -24.591 10.121 -9.604 -9.562
# Regression of Y on D, Xtilde, and Xtilde
lm(Y ~ D * Xtilde, dat)
##
## Name:
## lm(components = Y ~ D * Xtilde, information = dat)
##
## Coefficients:
## (Intercept) D Xtilde D:Xtilde
## -26.045 10.121 -9.604 -9.562
Every little thing works because it ought to! The coefficient on D
in every regression equals the ATE we computed by hand, particularly 10.121, and the 2 regression agree with one another excluding the intercept.
Normal Errors
The great factor about computing the ATE by working a regression slightly than computing it “by hand” is that we are able to simply get hold of legitimate commonplace errors, confidence intervals, and p-values if desired. For instance, in case you wished “strong” commonplace errors for the ATE, you might merely use lm_robust()
from the estimatr
package deal as follows
library(estimatr)
library(broom)
lm_robust(Y ~ D * Xtilde, dat) |>
tidy() |>
filter(time period == 'D') |>
choose(-df, -outcome)
## time period estimate std.error statistic p.worth conf.low conf.excessive
## 1 D 10.12062 0.4838613 20.91636 9.315921e-92 9.171946 11.06929
Getting these “by hand” would have been far more work!
There may be one delicate level that I ought to point out. I’ve heard it mentioned on quite a few events that the above commonplace error calculation is “not fairly proper” since we estimated the imply of X
and used it to re-center X
within the regression. Absolutely we must always account for the sampling variability in (bar{X}) round its imply, the argument goes.
Maybe I’m about to get blacklisted by the Econometrician’s alliance for saying this, however I’m not satisfied. The same old mind-set about inference for regression is conditional on the regressors, on this case (X) and (D). Seen from this attitude, (bar{X}) isn’t random. Now, after all, in case you choose to see the world by means of finite-population design-based lenses, (D) is positively random. However on this case it’s the solely factor that’s random. The design-based view situates randomness completely within the remedy task mechanism. Beneath this view, for the reason that items in our dataset usually are not thought of as having been drawn from a hypothetical super-population, any abstract statistic of their covariates (X) is fastened. So once more, (bar{X}) isn’t random and doesn’t contribute any uncertainty.
Replace: I initially concluded this part with “so far as I can see, it’s completely affordable to make use of the pattern imply of (X) to re-center (X) within the regression” however apoorva.lal identified that this elides an vital distinction. The secret is that whether or not (bar{X}) is random or not relies on the query you’re curious about. If you need inference for the ATE computed utilizing the inhabitants values of (X), then (bar{X}) is random and you must account for its variability. However in case you’re within the ATE computed utilizing the noticed values of (X) within the pattern, then (bar{X}) is fastened and also you shouldn’t:
Level about whether or not Xbar is random relies on whether or not you are curious about SATE v PATE proper? In any case, it’s surprisingly straightforward to propagate that uncertainty ahead with (what else?) GMM (earlier posts within the thread talk about the recentering level)https://t.co/3GXfTeF9DW
— apoorva.lal (@Apoorva__Lal) August 2, 2024
This agrees with my logic about conditioning on (X) and the design-based perspective, but it surely’s a a lot clearer method of constructing the related distinction so thanks for pointing it out!
Excluding the Interplay
Lastly, we’ll confirm the derivations from above for (alpha_1) within the regression that excludes an interplay time period. First we’ll compute the “variance weighted common” of CATEs by hand and examine that it doesn’t agree with the ATE:
# Compute the propensity rating pi(X)
pscore <- dat |>
group_by(X) |>
summarize(pi = imply(D))
# Compute the weights w
regression_adjustment <- left_join(regression_adjustment, pscore) |>
mutate(w = p * pi * (1 - pi) / sum(p * pi * (1 - pi)))
regression_adjustment # show the outcomes
## # A tibble: 2 × 7
## X p Ybar0 Ybar1 CATE pi w
##
## 1 0 0.849 -24.6 -13.0 11.6 0.105 0.713
## 2 1 0.151 -34.2 -32.2 2.01 0.692 0.287
# Compute the variance weighted common of the CATEs
wCATE <- regression_adjustment |>
summarize(wCATE = sum(w * CATE)) |>
pull(wCATE)
c(wCATE = wCATE, ATE = ATE)
## wCATE ATE
## 8.822285 10.120617
Lastly, we’ll evaluate this hand calculation to the outcomes of a regression of (Y) on (D) and (X) with out an interplay:
lm(Y ~ D + X, dat)
##
## Name:
## lm(components = Y ~ D + X, information = dat)
##
## Coefficients:
## (Intercept) D X
## -24.302 8.822 -14.614
As promised, the coefficient on (D) equals the variance-weighted common of CATEs that we computed by hand, particularly 8.822, which doesn’t equal the ATE, 10.121. Right here the CATE for (X=1) receives extra weight when the interplay time period is omitted, pulling the coefficient on (D) away from the ATE and in the direction of the (smaller) CATE for (X=1).
Conclusion
I hope this publish has satisfied you that regression adjustment isn’t merely a matter of tossing a set of covariates into your regression! Typically, the coefficient on (D) in a regression of (Y) on (X) and (D) will not equal the ATE of (D). As an alternative will probably be a weighted common of CATEs. To acquire the ATE we have to embrace an interplay between (X) and (D). The best option to get your favourite statistical software program package deal to calculate this for you, together with an acceptable commonplace error, is by de-meaning (X) earlier than together with the interplay. And don’t neglect that causal inference all the time requires untestable assumptions, on this case the selection-on-observables assumption. Whereas implementation particulars are vital, getting them proper gained’t make any distinction in case you’re not adjusting for the proper covariates within the first place.
Appendix: The Lacking Algebra
This part gives the algebra wanted to justify the expression for (alpha_1) from a regression that omits the interplay between (D) and (X). Particularly, we’ll present that
[
frac{text{Cov}(Y,tilde{D})}{text{Var}(tilde{D})} = frac{mathbb{E}[text{Var}(D|X)(beta_1 + beta_3 X)]}{mathbb{E}[text{Var}(D|X)]}.
]
the place (tilde{D}) is the error time period from a inhabitants linear regression of (D) on (X), particularly (D = gamma_0 + gamma_1 X + tilde{D}) in order that (mathbb{E}(tilde{D}) = mathbb{E}(Xtilde{D}) = 0) by development. The proof isn’t too troublesome, but it surely’s a bit tedious so I believed you may choose to skip it on a primary studying. Nonetheless right here? Nice! Let’s dive into the algebra.
We have to calculate (textual content{Cov}(Y, tilde{D})) and (textual content{Var}(tilde{D})). A pleasant option to perform this calculation is by making use of the legislation of whole covariance. You’ll have heard of the legislation of whole variance, however for my part the legislation of whole covariance is extra helpful. Simply as you possibly can deduce all of the properties of variance from the properties of covariance, utilizing (textual content{Cov}(W, W) = textual content{Var}(W)), you possibly can deduce the legislation of whole variance from the legislation of covariance! Within the current instance, the legislation of whole covariance permits us to write down
[
text{Cov}(Y, tilde{D}) = mathbb{E}[text{Cov}(Y, tilde{D}|X)] + textual content{Cov}[mathbb{E}(Y|X), mathbb{E}(tilde{D}|X)].
]
If this appears to be like intimidating, don’t fear: we’ll break it down piece by piece. The second time period on the RHS is a covariance between two random variables: (mathbb{E}(Y|X)) and (mathbb{E}(tilde{D},X)). We have already got an equation for (tilde{D}), particularly the inhabitants linear regression of (D) on (X), so let’s use it to simplify (mathbb{E}(tilde{D}|X)):
[
mathbb{E}(tilde{D}|X) = mathbb{E}(D – gamma_0 – gamma_1 X|X) = mathbb{E}(D|X) – gamma_0 – gamma_1 X.
]
Right here’s the important thing factor to notice: since (D) is binary, the inhabitants linear regression of (D) on (X) is equivalent to the conditional imply of (D) given (X). This tells us that (mathbb{E}(tilde{D}|X)=0). For the reason that covariance of something with a relentless is zero, the second time period on the RHS of the legislation of whole covariance drops out, leaving us with
[
text{Cov}(Y, tilde{D}) = mathbb{E}[text{Cov}(Y, tilde{D}|X)] = mathbb{E}[text{Cov}(Y, D – gamma_0 – gamma_1 X | X)].
]
Now let’s cope with the conditional covariance contained in the expectation. Keep in mind that conditioning on (X) is equal to saying “suppose that (X) have been recognized”. Something that’s recognized is fixed, not random. So we are able to deal with each (X) and (delta) as constants and apply the standard guidelines for covariance to acquire
[
text{Cov}(Y, D – gamma_0 – gamma_1 X | X) = text{Cov}(Y, D|X).
]
Subsequently, (textual content{Cov}(Y, tilde{D}) = mathbb{E}[text{Cov}(Y, D|X)]). A really comparable calculation utilizing the legislation of whole variance offers
[
begin{align*}
text{Var}(tilde{D}) &= mathbb{E}[text{Var}(tilde{D}|X)] + textual content{Var}[mathbb{E}(tilde{D}|X)] =mathbb{E}[text{Var}(tilde{D}|X)]
&= mathbb{E}[text{Var}(D – gamma_0 – gamma_1 X| X)]
&= mathbb{E}[text{Var}(D|X)]
finish{align*}
]
since (mathbb{E}(tilde{D}|X) = 0) and the variance of any fixed is just zero. So, with the assistance of the legal guidelines of whole covariance and variance, we’ve established that
[
alpha_1 equiv frac{text{Cov}(Y, tilde{D})}{text{Var}(tilde{D})}= frac{mathbb{E}[text{Cov}(Y, D|X)]}{mathbb{E}[text{Var}(D|X)]}
]
on this instance. Observe that this does not maintain usually: it depends on the truth that (mathbb{E}(tilde{D}|X)=0), which holds in our instance as a result of (mathbb{E}(D|X) = gamma_0 + gamma_1 X) on condition that (X) is binary.
We’re very almost completed. All that is still is to simplify the numerator. To do that, we’ll use the equality
[
Y = beta_0 + beta_1 D + beta_2 X + beta_3 DX + U
]
the place (U equiv Y – mathbb{E}(Y|D, X)) satisfies (mathbb{E}(U|D,X) = 0) by development. This permits us to write down
[
begin{align*}
text{Cov}(Y, D|X) &= text{Cov}(beta_0 + beta_1 D + beta_2 X + beta_3 DX + U, D|X)
&= beta_1 text{Cov}(D, D|X) + beta_3 text{Cov}(DX, D|X) + text{Cov}(U,D|X)
&= beta_1 text{Var}(D|X) + beta_3 X cdot text{Var}(D|X) + text{Cov}(U,D|X)
&= text{Var}(D|X)(beta_1 + beta_3 X) + text{Cov}(U, D| X).
end{align*}
]
So what about that pesky (textual content{Cov}(U,D|X)) time period? By the legislation of iterated iterations this seems to equal zero, since
[
begin{align*}
text{Cov}(U,D|X) &= mathbb{E}(DU|X) – mathbb{E}(D|X) mathbb{E}(U|X)
&= mathbb{E}_X[Dmathbb{E}(U|D,X)] – mathbb{E}(D|X) mathbb{E}_X[mathbb{E}(U|D,X)]
finish{align*}
]
and, once more, (mathbb{E}(U|D,X) = 0) by development. So we’re left with
[
alpha_1 = frac{mathbb{E}[text{Cov}(Y, D|X)]}{mathbb{E}[mathbb{E}[text{Var}(D|X)]} = frac{mathbb{E}[text{Var}(D|X)(beta_1 + beta_3 X)]}{mathbb{E}[text{Var}(D|X)]}.
]