Many outcomes of curiosity in economics are binary.
For instance, we might wish to find out how employment standing (Y^*) varies with demographics (X), the place (Y^*=1) means “employed” and (Y^*=0) means unemployed or not within the labor power.
However how do we all know if somebody is employed?
Usually we ask them, maybe as half of a giant, nationally consultant survey such because the CPS.
Researchers who research labor market dynamics have lengthy identified, nonetheless, that noticed information on labor market standing are sometimes inaccurate (Poterba & Summers, 1986).
Administrative errors creep into even probably the most carefully-administered surveys.
However extra importantly, survey respondents don’t all the time inform the reality, whether or not by mistake or intentionally, and this downside appears to have gotten worse in recent times (Meyer et al. 2015) As a substitute of true employment standing (Y^*) researchers solely observe a loud measure (Yin {0, 1}).
In my earlier submit I confirmed that classical measurement error an consequence variable is mainly innocuous.
However I additionally confirmed that measurement error in a binary random variable can’t be classical.
On this submit, I’ll discover the implications of this truth once we wish to be taught (mathbb{P}(Y^*=1|X)) however solely observe (Y) and (X), not the true consequence variable (Y^*).
To maintain issues concrete, I’ll assume all through that (mathbb{P}(Y^*=1|X) = F(X’beta)) the place (F) is a strictly growing, differentiable operate. This covers all the same old suspects: logit, probit, and the linear chance mannequin.
The parameter (beta) might have a causal interpretation or might merely have a predictive one.
Both means, the query I’ll concentrate on right here is whether or not, and if that’s the case how, (beta) could be recognized within the presence of measurement error.
For simplicity I’ll assume all through that that the covariates (X) are measured with out error.
Why does observing (Y) moderately than (Y^*) current an issue?
To reply this query, we have to derive the connection between
(mathbb{P}(Y=1|X)) and (mathbb{P}(Y^*=1|X)).
Since (Y) and (Y^*) are each binary, (mathbb{E}(Y|X) = mathbb{P}(Y=1|X)) and equally
[
mathbb{E}(Y^*|X) = mathbb{P}(Y^*=1|X) = F(X’beta).
]
Now outline the measurement error (W) as (W = Y – Y^*) so we will write (Y = Y^* + W).
By the linearity of expectation,
[
begin{aligned}
mathbb{P}(Y=1|X) &= mathbb{E}(Y|X)
&= mathbb{E}(Y^* + W|X) = mathbb{E}(Y^*|X) + mathbb{E}(W|X)
&= mathbb{P}(Y^*=1|X) + mathbb{E}(W|X)
end{aligned}
]
so we see that if (mathbb{E}(W|X)=0) then (mathbb{P}(Y=1|X)) and (mathbb{P}(Y^*=1|X)) will coincide.
Sadly (mathbb{E}(W|X)) basically can’t be zero.
Which means that studying (mathbb{P}(Y=1|X)) won’t inform us what we wish to know: (mathbb{P}(Y^*=1|X)).
To see why (mathbb{E}(W|X) neq 0), first outline the mis-classification chances (alpha_0(cdot)) and (alpha_1(cdot))
[
begin{aligned}
alpha_0(X) &equiv P(Y=1|Y^*=0,X)
alpha_1(X) &equiv P(Y=0|Y^*=1,X).
end{aligned}
]
The subscripts on (alpha) confer with the worth of (Y^*) on which we situation: (alpha_0(cdot)) circumstances on (Y^*=0) whereas (alpha_1(cdot)) circumstances on (Y^*=1).
You’ll be able to interpret the mis-classification chances by analogy to null speculation testing: (alpha_0(X)) is successfully the sort I error charge as a operate of (X) and (alpha_1(X)) is the sort II error charge as a operate of (X).
To maintain issues absolutely basic for the second, we enable the mis-classification chances to rely on (X). Maybe a younger male employee with 5 years of expertise is extra more likely to make an faulty self-report within the CPS than and older feminine employee with extra expertise, for instance. Since (Y) and (Y^*) are each binary, (W in {-1, 0, 1}) and we calculate (mathbb{E}(W|X)) as follows:
[
begin{aligned}
mathbb{E}(W|X) &= -1 times mathbb{P}(W=-1|X) + 0 times mathbb{P}(W=0|X) + 1 times mathbb{P}(W=1|X)
&= mathbb{P}(W=1|X) – mathbb{P}(W=-1|X).
end{aligned}
]
Now take into account the occasion ({W = -1}).
The one means that this might happen is that if (Y = 0) and (Y^* = 1).
Accordingly,
[
begin{aligned}
mathbb{P}(W = -1|X) &= mathbb{P}(Y = 0, Y^* = 1|X)
&= mathbb{P}(Y=0|Y^*=1,X)mathbb{P}(Y^*=1|X)
&= alpha_1(X) F(X’beta).
end{aligned}
]
Equally, the one means that ({W=1}) can happen is that if (Y=1) and (Y^*=0) in order that
[
begin{aligned}
mathbb{P}(W = 1|X) &= mathbb{P}(Y = 1, Y^* = 0|X)
&= mathbb{P}(Y=1|Y^*=0,X) mathbb{P}(Y^*=0|X)
&= alpha_0(X) left[1 – F(X’beta)right].
finish{aligned}
]
Subsequently,
[
begin{aligned}
mathbb{E}(W|X) &= mathbb{P}(W=1|X) – mathbb{P}(W=-1|X)
&= alpha_0(X)left[1 – F(X’beta) right] -alpha_1(X) F(X’beta).
finish{aligned}
]
So how may (mathbb{E}(W|X) = 0)?
Re-arranging the previous to unravel for (F(X’beta)),
[
mathbb{E}(W|X) = 0 iff F(X’beta) = frac{alpha_0(X)}{alpha_0(X) + alpha_1(X)}.
]
This reveals that that (mathbb{E}(W|X)) can solely be zero in an extraordinarily peculiar case the place (alpha_0(cdot)) and (alpha_1(cdot)) rely on (X) in simply the proper means.
If the mis-classification chances are constants that don’t rely on (X), we’d require (F(X’beta) = alpha_0/(alpha_0 + alpha_1)).
That is solely potential if all the weather of (beta) in addition to the intercept are zero.
Since (mathbb{E}(W|X)) won’t basically equal zero, (mathbb{P}(Y=1|X)) won’t basically equal (mathbb{P}(Y^*=1|X)).
Substituting our expression for (mathbb{E}(W|X)) and factoring the outcome,
[
begin{aligned}
mathbb{P}(Y=1|X) &= mathbb{P}(Y^*=1|X) + E(W|X)
&= F(X’beta) + alpha_0(X)left[1 – F(X’beta) right] -alpha_1(X) F(X’beta)
&= alpha_0(X) + F(X’beta) [1 – alpha_0(X) – alpha_1(X)].
finish{aligned}
]
As a result of we observe ((Y, X)), (mathbb{P}(Y=1|X)) is recognized.
With sufficient information, we will be taught this conditional chance as a operate of (X) as precisely as we want.
The issue is that (alpha_0(X)) and (alpha_1(X)) drive a wedge between what we will observe, (mathbb{P}(Y=1|X)), and what we’re making an attempt to be taught (mathbb{P}(Y^*=1|X) = F(X’beta)). With out understanding extra concerning the capabilities (alpha_0(cdot)) and (alpha_1(cdot)) we will’t say a lot about how (mathbb{P}(Y=1|X)) and (mathbb{P}(Y^*=1|X)) will differ.
As a result of they’re chances,
[
0leq alpha_0(X) leq 1, quad 0leq alpha_1(X) leq 1.
]
However as a result of they’re conditional chances that situation on totally different occasions, ({Y^*=0, X=x}) versus ({Y^*=1, X=x}), the sum (alpha_0(x) + alpha_1(x)) may very well be larger than one.
Which means that (1 – alpha_0(X) – alpha_1(X)) may very well be unfavourable, at the least for sure values of (X).
It’s widespread in observe, nonetheless, to imagine that (alpha_0(X) + alpha_1(X) < 1) for all potential values that the covariates (X) may tackle.
To know this assumption, and the issue extra typically, it’s useful to think about a easy particular case during which the mis-classification chances don’t rely on (X).
On this case we will say exactly how measurement error within the consequence impacts what we will be taught concerning the relationship between (X) and (Y^*) and clarify why (alpha_0(X) + alpha_1(X) < 1) is often an inexpensive assumption.
A Particular Case: Fastened Mis-classification
Suppose that the mis-classification chances are mounted, i.e. that
[
begin{aligned}
alpha_0(X) &equiv mathbb{P}(Y=1|Y^*=0|X) = mathbb{P}(Y=1|Y^*=0)equiv alpha_0
alpha_1(X) &equiv mathbb{P}(Y=0|Y^*=1|X) = mathbb{P}(Y=0|Y^*=1)equiv alpha_1.
end{aligned}
]
This can be a pretty sturdy assumption.
It says that each self-reporting and administrative errors happen on the similar charge for everybody, no matter their noticed traits.
On this case, our expression for (mathbb{P}(Y=1|X)) from above turns into
[
begin{aligned}
mathbb{P}(Y=1|X) &= alpha_0 + F(X’beta) (1 – alpha_0 – alpha_1).
end{aligned}
]
Defining (f) because the spinoff of (F), which means the noticed partial impact of a steady covariate (X_j) with respect to (Y) is
[
begin{align*}
frac{partial}{partial X_j} mathbb{P}(Y=1|X) &= frac{partial}{partial X_j} left[ alpha_0 + F(X’beta)(1 – alpha_0 – alpha_1)right]
&= f(X’beta)beta_j (1 – alpha_0 – alpha_1)
finish{align*}
]
whereas the true partial impact, with respect to (Y^*), is
[
frac{partial}{partial X_j} mathbb{P}(Y^*=1|X) = frac{partial}{partial X_j} F(X’beta) = f(X’beta) beta_j.
]
If ((alpha_0 + alpha_1) > 1) then ((1 – alpha_0 – alpha_1)) will likely be unfavourable.
Which means that the measurement error downside is so extreme that the entire noticed partial results have the unsuitable signal.
A little bit of tedious algebra reveals that (Y) and (Y^*) should be negatively correlated on this case: (Y) is such a loud measure of (Y^*) that when (Y=1) we’re higher off predicting that (Y^*=0).
Because of this, it’s conventional to imagine that (alpha_0 + alpha_1 < 1). On this case (0 < (1 – alpha_0 – alpha_1) leq 1) so the noticed partial results are attenuated variations of the true partial results, in that
[
0 < frac{partial}{partial X_j} mathbb{P}(Y=1|X) leq frac{partial}{partial X_j} mathbb{P}(Y^*=1|X).
]
So on this particular case, non-classical measurement error in a binary consequence variable has the identical impact as classical measurement in a steady regressor: attenuation bias.
Observational Equivalence and ((alpha_0 + alpha_1))
It’s possible you’ll be questioning: do we actually want the idea (alpha_0 + alpha_1<1) or is it merely handy? Couldn’t the noticed information inform us whether or not (alpha_0 + alpha_1) is lower than one or larger than one? The reply seems to be no, and it’s straightforward to point out why if (F(t) = 1 – F(-t)) as within the probit, logit, and the linear chance fashions. Suppose that this situation on (F) holds. Then we will write
[
begin{aligned}
mathbb{P}(Y=1|X) &= alpha_0 + (1 – alpha_0 – alpha_1) F(X’beta)
&= alpha_0 + (1 – alpha_0 – alpha_1) left[ 1 – Fbig(X'(-beta)big)right]
&= left(alpha_0 + 1 – alpha_0 – alpha_1right) – (1 – alpha_0 – alpha_1) Fbig(X'(-beta)huge)
&= (1 – alpha_1) + left[ alpha_0 – (1 – alpha_1) right] Fbig(X’ (-beta)huge)
&= (1 – alpha_1) + left[1 – (1 – alpha_1) – (1 – alpha_0) right] Fbig(X’ (-beta)huge).
finish{aligned}
]
Defining (widetilde{alpha}_0 equiv (1 – alpha_1)), (widetilde{alpha}_1 equiv (1 – alpha_0)), and (widetilde{beta} equiv -beta), we now have established that
[
begin{aligned}
mathbb{P}(Y=1|X) &= alpha_0 + (1 – alpha_0 – alpha_1) F(X’beta)
&= widetilde{alpha}_0 + left( 1 – widetilde{alpha}_0 – widetilde{alpha}_1 right) Fbig(X’widetilde{beta}big).
end{aligned}
]
Since (Y) is binary, (mathbb{P}(Y=1|X)) tells us every thing that there’s to know concerning the distribution of (Y) given (X).
The previous pair of equalities reveals that the noticed conditional distribution (mathbb{P}(Y=1|X)) may simply as properly have arisen from (mathbb{P}(Y^*=1|X) = F(X’widetilde{beta})) with mis-classification chances ((widetilde{alpha}_0, widetilde{alpha}_1)) because it may from (mathbb{P}(Y^*=1|X) = F(X’beta)) with mis-classification chances ((alpha_0, alpha_1)). From observations of ((Y, X)) alone there isn’t any option to inform these prospects aside: we are saying that they’re observationally equal. Discover that if (alpha_0 + alpha_1 < 1) then (widetilde{alpha}_0 + widetilde{alpha}_1 > 1) and vice-versa. This reveals that the one option to level establish (beta) is to imagine both that (alpha_0 + alpha_1 < 1) or the reverse inequality. For the explanations mentioned above, it often is sensible to decide on (alpha_0 + alpha_1 < 1).
So what’s an utilized researcher to do? If we may by some means be taught the mis-classification chances, we may use them to “modify” (mathbb{P}(Y=1|X)) and establish (mathbb{P}(Y^*=1|X) = F(X’beta)) as follows:
[
F(X’beta) = frac{mathbb{P}(Y=1|X) – alpha_0(X)}{1 – alpha_0(X) – alpha_1(X)}.
]
Broadly talking there are two methods be taught the mis-classification chances.
The primary method estimates (alpha_0(X)) and (alpha_1(X)) utilizing a second dataset.
The second method makes use of a single dataset and exploits non-linearity within the operate (F) as a substitute.
For the rest of this dialogue I’ll assume that (alpha_0(X) + alpha_1(X) < 1) or (alpha_0 + alpha_1 < 1) if the mis-classification chances are mounted.
Methodology 1: Auxiliary Information
Let’s begin by making life easy: assume mounted mis-classification. Now suppose that we observe two random samples from the similar inhabitants. Within the first, we observe pairs ((Y_i,X_i)) for (i = 1, …, n) and within the second we observe pairs ((Y_j, Y^*_j)) for (j = 1, …, m). Discover that neither dataset comprises observations of (X) and (Y^*) for a similar particular person. Utilizing the ((Y_i,X_i)) observations we will estimate (mathbb{P}(Y=1|X)), and utilizing the ((Y_j, Y^*_j)) observations we will estimate
[
alpha_0 = mathbb{P}(Y=1|Y^*=0), quad alpha_1 = mathbb{P}(Y=0|Y^*=1).
]
This provides us every thing we have to decide (F(X’beta)) as a operate of (X) and therefore (beta). The observations of ((Y_i, Y^*_i)) are known as an auxiliary dataset. In idea, auxiliary information present a easy and basic answer to measurement error issues of all stripes. Suppose, for instance, that we had been uncomfortable with the idea of mounted mis-classification. If we noticed an auxiliary dataset of triples ((Y_j, Y_j^*, X_j)) then we may instantly estimate (alpha_0(X)) and (alpha_1(X)). In fact, we if we noticed ((Y_j, Y_j^*, X_j)) for a random pattern drawn from the inhabitants of curiosity we may estimate (mathbb{P}(Y^*=1|X)) instantly with out the necessity to account for measurement error! And right here lies the basic pressure of the auxiliary information method: if we had sufficiently wealthy auxiliary information we wouldn’t actually have a measurement error downside within the first place. Extra usually, we both observe ((Y_j, Y_j^*, X_j)) for a totally different inhabitants, or solely observe a subset of those variables for our inhabitants of curiosity. Both means we have to depend on modeling assumptions to bridge the hole. For instance, mounted mis-classification and an auxiliary dataset of ((Y^*_j, Y_j)) suffice to unravel the measurement error downside however provided that (alpha_0(X)) and (alpha_1(X)) don’t the truth is rely on (X).
Methodology 2: Nonlinearity of (F)
The auxiliary information method may be very basic in precept however depends on data that we merely might not have in observe: a second dataset from the identical inhabitants. An alternate method makes use of just one dataset, ((Y_i, X_i)) for (i = 1, …, n), and as a substitute exploits the form of the operate (F). This second method is a bit much less basic however doesn’t require any outdoors sources of knowledge.
To start, suppose that the mis-classification chances are mounted and that (F) is a identified operate, e.g. the usual logistic CDF. Suppose additional that (F) is strictly growing and therefore invertible. Then, making use of (F^{-1}) to either side of our expression for (F(X’beta)) from above,
[
X’beta = F^{-1} left[frac{mathbb{P}(Y = 1|X) – alpha_0}{1 – alpha_0 – alpha_1}right]
]
and thus, pre-multiplying either side by (X) and taking expectations,
[
mathbb{E}[XX’]beta = mathbb{E}left{X F^{-1} left[frac{mathbb{P}(Y = 1|X) – alpha_0}{1 – alpha_0 – alpha_1}right]proper}.
]
Subsequently, if (mathbb{E}[XX’]) is invertible,
[
beta = mathbb{E}[XX’]^{-1}mathbb{E}left{X F^{-1} left[frac{mathbb{P}(Y = 1|X) – alpha_0}{1 – alpha_0 – alpha_1}right]proper}.
]
Since (mathbb{P}(Y=1|X)) relies upon solely on the noticed information ((Y,X)), this operate is level recognized. Since (F) is assumed to be a identified operate, it follows that (beta) is level recognized every time (mathbb{E}[XX’]) is invertible and ((alpha_0, alpha_1)) are identified. So if we will discover a option to level establish (alpha_0) and (alpha_1), we’ll instantly establish (beta).
Simpler mentioned than finished! How can we probably be taught (alpha_0) and (alpha_1) with out auxiliary information? Nonlinearity is the important thing. If (F) is a cumulative distribution operate, then (lim_{trightarrow infty} F(t) = 1) and (lim_{trightarrow -infty} F(t) = 0). Now suppose that (X) comprises at the least one covariate, name it (V), that’s steady and has “giant help,” i.e. takes on values in a really wide selection. With out lack of generality, suppose that the coefficient (beta_v) on (V) is constructive. (If it’s unfavourable, then apply the next argument to (-V) as a substitute.) For (V) giant and constructive (X’beta) is giant and constructive in order that (F(X’beta)) is shut to at least one.
On this case
[
begin{aligned}
alpha_0 + (1 – alpha_0 – alpha_1) F(X’beta) &approx alpha_0 + (1 – alpha_0 – alpha_1) times 1
&= (1 – alpha_1).
end{aligned}
]
For (V) giant and unfavourable, however, (X’beta) is giant and unfavourable, (F(X’beta)) is near zero, and
[
begin{aligned}
alpha_0 + (1 – alpha_0 – alpha_1) F(X’beta) &approx alpha_0 + (1 – alpha_0 – alpha_1) times 0
&= alpha_0.
end{aligned}
]
Intuitively, by inspecting values of (X_i) for which (F(X_i’beta)) is shut to at least one we will be taught ((1 – alpha_1)) and by inspecting values for which (F(X_i’beta)) is near zero we will establish (alpha_0).
It’s possible you’ll object that the previous identification argument sounds suspiciously round: doesn’t this concept at the least implicitly require us to know (beta)? Thankfully the reply isn’t any. We solely have to know the indicators of (beta). Beneath the idea that (alpha_0 + alpha_1 < 1) these are the identical because the indicators of the noticed partial results (partial mathbb{P}(Y=1|X) /partial beta_j). An instance might assist. Suppose (Y=1) means “graduated from school.” Beneath mounted misclassification, we’d be taught (alpha_0) from the observations of ((Y_i, X_i)) for individuals who nearly definitely didn’t graduate from school, primarily based on their covariates, and ((1 – alpha_1)) from observations of ((Y_i, X_i)) for individuals who nearly definitely did. By first estimating (mathbb{P}(Y=1|X)) we be taught attenuated variations of the true partial results (F(X’beta) beta_j). In different phrases, we find out how reported schooling varies with (X). However this data suffices to point out us make (F(X’beta)) near zero or one.
The previous argument crucially depends on the idea that (F) is nonlinear. To see why, take into account the linear chance mannequin (F(X’beta) = X’beta) and let (X’ = (1, X_1′)) and (beta’ = (beta_0, beta_1′)).
Then,
[
begin{aligned}
mathbb{P}(Y=1|X)
&= alpha_0 + left(1 – alpha_0 – alpha_1right) F(X’beta)
&= alpha_0 + left(1 – alpha_0 – alpha_1right)(X’beta)
&= alpha_0 + left(1 – alpha_0 – alpha_1right)(beta_0 + X_1′ beta_1)
&= alpha_0 + (1 – alpha_0 – alpha_1) beta_0 + X_1′ (1 – alpha_0 – alpha_1) beta_1.
end{aligned}
]
Now, defining (widetilde{beta}_0 equiv alpha_0 + (1 – alpha_0 – alpha_1)beta_0), (widetilde{beta}_1 = (1 – alpha_0 – alpha_1) beta_1), and (widetilde{beta}’ = (widetilde{beta}_0, widetilde{beta}_1′)) we now have
[
mathbb{P}(Y=1|X) = alpha_0 + (1 – alpha_0 – alpha_1) X’beta = X’widetilde{beta}.
]
This reveals {that a} linear chance mannequin with coefficient vector (beta) and mis-classification chances ((alpha_0, alpha_1)) is observationally equal to a linear chance mannequin with no mis-classification and coefficient (widetilde{beta}). To place it one other means: there isn’t any option to inform whether or not mis-classification is current or absent in a linear mannequin. Doing so requires non-linearity.
So how can we use these ends in observe? If ((alpha_0, alpha_1, beta)) are recognized and (F) is assumed identified, we will proceed by way of garden-variety most probability estimation. The log-likelihood operate is just barely extra difficult than in the usual binary consequence setting, particularly:
[
begin{aligned}
ell_n(alpha_0, alpha_1, beta) &= frac{1}{n} sum_{i=1}^n logleft{ mathbb{P}(Y_i=1|X)^{mathbb{1}(Y_i=1)}mathbb{P}(Y_i=0|X_i)^{mathbb{1}(Y_i=0)} right}
&= frac{1}{n} sum_{i=1}^n Y_i logleft{ alpha_0 + (1 – alpha_0 – alpha_1) F(X_i’beta) right} + (1 – Y_i) logleft{ 1 – alpha_0 – (1 – alpha_0 – alpha_1) F(X_i’beta) right}.
end{aligned}
]
If (F) is unknown, estimation is extra difficult however the instinct from above continues to carry: a regressor (V) with “giant help” permits us to establish the mis-classification chances, and therefore (F). Certainly, we will even enable (alpha_0) and (alpha_1) to rely covariates, so long as they don’t rely on (V) itself. For extra particulars on the “identification by nonlinearity” method, see Hausman et al. (1998) and Lewbel (2000).
That’s greater than sufficient about measurement error for now! Once I return to this matter in a couple of weeks time, I’ll take into account the issue of a mis-measured binary regressor. In my subsequent installment, nonetheless, I’ll put measurement error to at least one facet and revisit a basic downside from introductory statistics: setting up a confidence interval for a inhabitants proportion. Generally the simplest issues change into a lot more durable than they first seem.
