Tuesday, January 13, 2026

Estimation underneath omitted confounders, endogeneity, omitted variable bias, and associated issues


Preliminary ideas

Estimating causal relationships from information is without doubt one of the elementary endeavors of researchers, however causality is elusive. Within the presence of omitted confounders, endogeneity, omitted variables, or a misspecified mannequin, estimates of predicted values and results of curiosity are inconsistent; causality is obscured.

A managed experiment to estimate causal relations is an alternate. But conducting a managed experiment could also be infeasible. Coverage makers can not randomize taxation, for instance. Within the absence of experimental information, an possibility is to make use of instrumental variables or a management operate method.

Stata has many built-in estimators to implement these potential options and instruments to assemble estimators for conditions that aren’t coated by built-in estimators. Under I illustrate each potentialities for a linear mannequin and, in a later submit, will discuss nonlinear fashions.

Linear mannequin instance

I begin with a linear mannequin with two covariates, (x_1) and (x_2). On this mannequin, (x_1) is unrelated to the error time period, (varepsilon); that is given by the situation (Eleft(x_1varepsilon proper) = 0). (x_1) is exogenous. (x_2) is expounded to the error time period; that is given by (Eleft(x_2varepsilon proper) neq 0). (x_2) is endogenous. The mannequin is given by

start{eqnarray*}
y &=& beta_0 + x_1beta_1 + x_2beta_2 + varepsilon
Eleft(x_1varepsilon proper) &=& 0
Eleft(x_2varepsilon proper) &neq& 0
finish{eqnarray*}

The truth that (x_2) is expounded to the unobservable element (varepsilon) signifies that becoming this mannequin utilizing linear regression yields inconsistent parameter estimates.

One possibility is to make use of a two-stage least-squares estimator. For 2-stage least squares to be legitimate, I must accurately specify a mannequin for (x_2) that features a variable, (z_1), that’s unrelated to the unobservables of the end result of curiosity and (x_1). We additionally want (z_1) and (x_1) to be unrelated to the unobservable of the end result, (varepsilon), and the unobservable of the equation for (x_2). These situations are expressed by

start{eqnarray}
x_2 &=& Pi_0 + z_1Pi_1 + x_1Pi_2 + nu label{instrument}tag{1}
Eleft(z’varepsilon proper)&=& Eleft(z’nu proper) = 0 label{scores} tag{2}
z &equiv & left[z_1 quad x_1 right] notag
finish{eqnarray}

The connection in eqref{instrument} implies that (x_2) may be cut up into two elements: one that’s associated to (varepsilon), and is due to this fact the crux of the issue, (nu), and one other that’s unrelated to (varepsilon), (Pi_0 + z_1Pi_1 + x_1Pi_2). The important thing to two-stage least squares is to get a constant estimator of the latter element of (x_2).

Under I simulate information that fulfill the assumptions above.


set obs 1000
set seed 111
generate e = rchi2(2) - 2
generate v = 0.5*e + rchi2(1) - 1
generate x1 = rchi2(1)-1
generate z1 = rchi2(1)-2
generate x2 = 2*(1 - x1 - z1) + v
generate y  = 2*(1 - x1 - x2) + e

If I estimate the mannequin parameters utilizing two-staged least squares, I get hold of


. ivregress 2sls y x1 (x2 = z1)

Instrumental variables (2SLS) regression          Variety of obs   =      1,000
                                                  Wald chi2(2)    =    9351.74
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.9124
                                                  Root MSE        =     2.1175

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   -2.00309   .0227842   -87.92   0.000    -2.047746   -1.958434
          x1 |  -2.010345   .0665863   -30.19   0.000    -2.140851   -1.879838
       _cons |   2.098502   .1158818    18.11   0.000     1.871378    2.325626
------------------------------------------------------------------------------
Instrumented:  x2
Devices:   x1 z1

I recuperate the coefficient values for the covariates, that are (-2) for x1, (-2) for x2, and a couple of for the fixed.

I can even recuperate the parameters of the mannequin by way of structural equation modeling utilizing sem. The important thing right here is to specify two linear equations and state that the unobservable elements of each equations are correlated. Curiously, sem estimation assumes joint normality of the unobservables, which isn’t glad by the mannequin, but I get hold of constant estimates as illustrated by the coefficient values within the equation for y within the output desk beneath:


. sem (y <- x1 x2) (x2 <- x1 z1), cov(e.y*e.x2) nolog

Endogenous variables

Noticed:  y x2

Exogenous variables

Noticed:  x1 z1

Structural equation mannequin                       Variety of obs     =      1,000
Estimation technique  = ml
Log chance     = -7564.0866

------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural   |
  y <-       |
          x2 |   -2.00309   .0227842   -87.92   0.000    -2.047746   -1.958434
          x1 |  -2.010345   .0665863   -30.19   0.000    -2.140851   -1.879838
       _cons |   2.098502   .1158818    18.11   0.000     1.871378    2.325626
  -----------+----------------------------------------------------------------
  x2 <-      |
          x1 |   -1.97412   .0431461   -45.75   0.000    -2.058685   -1.889555
          z1 |  -1.958336   .0394089   -49.69   0.000    -2.035576   -1.881096
       _cons |   2.164649   .0713838    30.32   0.000     2.024739    2.304558
-------------+----------------------------------------------------------------
     var(e.y)|   4.483833   .2266015                      4.060989    4.950704
    var(e.x2)|    3.49781   .1564268                      3.204271    3.818238
-------------+----------------------------------------------------------------
cov(e.y,e.x2)|   2.316083   .1655267    13.99   0.000     1.991657     2.64051
------------------------------------------------------------------------------
LR check of mannequin vs. saturated: chi2(0)   =      0.00, Prob > chi2 =      .

The syntax of sem requires that I write the 2 linear equations; set up which variables are endogenous utilizing an <-; and state that the unobservables of the 2 endogenous variables, denoted by e.y and e.x2, are correlated. The correlation is specified utilizing the choice cov(e.y*e.x2).

The coefficients and normal errors I get hold of utilizing sem are precisely the identical as these from two-stage least squares. This equivalence happens between moment-based estimation, like two-stage least squares and the generalized technique of moments (GMM), and likelihood- and quasilikelihood-based estimators, when the second situations and rating equations are the identical. Due to this fact, even when the assumptions are totally different, the estimating equations are the identical. The estimating equations for these fashions are given by eqref{scores}.

I may additionally match this mannequin utilizing GMM applied in gmm. Right here is a method to do that:

  1. Write the residuals of the equations of the endogenous variables. On this instance, (varepsilon = y – (beta_0 + x_1beta_1 + x_2beta_2)) and (nu = x_2 – (Pi_0 + z_1Pi_1 + x_1Pi_2)).

  2. Use all of the exogenous variables within the system as devices, on this case, (x_1) and (z_1).

Utilizing gmm provides us


. gmm (eq1: y  - {xb: x1 x2 _cons})          
> (eq2: x2 - {xpi: x1 z1 _cons}),            
> devices(x1 z1)                         
> winitial(unadjusted, unbiased) nolog

Ultimate GMM criterion Q(b) =  7.35e-32

observe: mannequin is strictly recognized

GMM estimation

Variety of parameters =   6
Variety of moments    =   6
Preliminary weight matrix: Unadjusted                 Variety of obs   =      1,000
GMM weight matrix:     Sturdy

------------------------------------------------------------------------------
             |               Sturdy
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb           |
          x1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
          x2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
       _cons |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
-------------+----------------------------------------------------------------
xpi          |
          x1 |   -1.97412   .0380642   -51.86   0.000    -2.048724   -1.899515
          z1 |  -1.958336   .0448162   -43.70   0.000    -2.046174   -1.870498
       _cons |   2.164649   .0772429    28.02   0.000     2.013255    2.316042
------------------------------------------------------------------------------
Devices for equation eq1: x1 z1 _cons
Devices for equation eq2: x1 z1 _cons

As soon as once more, I get hold of the identical parameter values as with ivregress and gsem. Nonetheless, the usual errors are totally different. The reason being that gmm computes sturdy normal errors by default. If I compute ivregress with sturdy normal errors, the outcomes are once more precisely the identical:


. ivregress 2sls y x1 (x2 = z1), vce(sturdy)

Instrumental variables (2SLS) regression          Variety of obs   =      1,000
                                                  Wald chi2(2)    =    6028.31
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.9124
                                                  Root MSE        =     2.1175

------------------------------------------------------------------------------
             |               Sturdy
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
          x1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
       _cons |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
------------------------------------------------------------------------------
Instrumented:  x2
Devices:   x1 z1

One other option to get hold of the parameters of curiosity is utilizing a management operate method. This makes use of the residuals, from a regression of the endogenous variable (x_2) on the devices (x_1) and (z_1), as regressors for the regression of (y) on (x_1) and (x_2). Under I implement the management operate method utilizing gmm.


. native xb ({b1}*x1 + {b2}*x2 + {b3}*(x2-{xpi:}) + {b0})

. gmm (eq1: x2 - {xpi: x1 z1 _cons})          
>     (eq2: y  - `xb')                                
>     (eq3: (y  - `xb')*(x2-{xpi:})),         
> devices(eq1: x1 z1)                     
> devices(eq2: x1 z1)                     
> winitial(unadjusted, unbiased) nolog

Ultimate GMM criterion Q(b) =  1.02e-31

observe: mannequin is strictly recognized

GMM estimation

Variety of parameters =   7
Variety of moments    =   7
Preliminary weight matrix: Unadjusted                 Variety of obs   =      1,000
GMM weight matrix:     Sturdy

------------------------------------------------------------------------------
             |               Sturdy
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   -1.97412   .0380642   -51.86   0.000    -2.048724   -1.899515
          z1 |  -1.958336   .0448162   -43.70   0.000    -2.046174   -1.870498
       _cons |   2.164649   .0772429    28.02   0.000     2.013255    2.316042
-------------+----------------------------------------------------------------
         /b1 |  -2.010345   .0776184   -25.90   0.000    -2.162474   -1.858215
         /b2 |   -2.00309   .0323531   -61.91   0.000    -2.066501   -1.939679
         /b3 |   .6621525   .0700282     9.46   0.000     .5248998    .7994052
         /b0 |   2.098502   .1530806    13.71   0.000     1.798469    2.398534
------------------------------------------------------------------------------
Devices for equation eq1: x1 z1 _cons
Devices for equation eq2: x1 z1 _cons
Devices for equation eq3: _cons

As within the earlier examples, I outline residuals and devices, and gmm creates a second situation utilizing these two items of knowledge. Within the instance above, the residuals from the regression of the endogenous variable on the exogenous variables of the mannequin are on the identical time residuals and devices. Thus I don’t embody them as an exogenous instrument. As a substitute, I assemble the second situation for the residuals from the regression of the endogenous variable manually in eq3.

Utilizing the management operate method once more provides the identical outcomes as within the three earlier instances.

Within the first instance, I used an estimator that exists in Stata. Within the final two examples, I used estimation instruments that permit us to acquire estimators for a big class of fashions.

Concluding remarks

Estimating the parameters of a mannequin within the presence of endogeneity or associated issues http://weblog.stata.com/tag/endogeneity/ is daunting. Above I illustrated the best way to estimate the parameters of such fashions utilizing instructions in Stata that had been created for such a goal and likewise illustrated the way you can use gmm and sem to estimate these fashions.



Related Articles

Latest Articles